Measuring Thematic Fit with Distributional Feature Overlap

In this paper, we introduce a new distributional method for modeling predicate-argument thematic fit judgments. We use a syntax-based DSM to build a prototypical representation of verb-specific roles: for every verb, we extract the most salient second order contexts for each of its roles (i.e. the most salient dimensions of typical role fillers), and then we compute thematic fit as a weighted overlap between the top features of candidate fillers and role prototypes. Our experiments show that our method consistently outperforms a baseline re-implementing a state-of-the-art system, and achieves better or comparable results to those reported in the literature for the other unsupervised systems. Moreover, it provides an explicit representation of the features characterizing verb-specific semantic roles.


Introduction
Several psycholinguistic studies in the last two decades have brought extensive evidence that humans activate a rich array of event knowledge during sentence processing: verbs (e.g. arrest) activate expectations about their typical arguments (e.g. cop, thief ) (McRae et al., 1998;Altmann and Kamide, 1999;Ferretti et al., 2001;McRae et al., 2005;Hare et al., 2009;Matsuki et al., 2011), and nouns activate other nouns typically co-occurring in the same events (Kamide et al., 2003;Bicknell et al., 2010). Subjects are able to determine the plausibility of a noun for a given argument role and quickly use this knowledge to anticipate upcoming linguistic input (McRae and Matsuki, 2009). This phenomenon is referred to in the literature as thematic fit. Thematic fit estimation has been extensively used in sentence comprehension studies on constraint-based models, mainly as a predictor variable allowing to disambiguate between possible structural analyses. 1 More in general, thematic fit is considered as a key factor in a variety of studies concerned with structural ambiguity (Vandekerckhove et al., 2009).
Starting from the work of Erk et al. (2010), several distributional semantic methods have been proposed to compute the extent to which nouns fulfill the requirements of verb-specific thematic roles, and their performances have been evaluated against human-generated judgments (Baroni and Lenci, 2010;Lenci, 2011;Sayeed and Demberg, 2014;Sayeed et al., 2015Greenberg et al., 2015a,b). Most research on thematic fit estimation has focused on count-based vector representations (as distinguished from prediction-based vectors). 2 Indeed, in their comparison between highdimensional explicit vectors and low-dimensional neural embeddings, Baroni et al. (2014) found that thematic fit estimation is the only benchmark on which prediction models are lagging behind stateof-the-art performance. This is consistent with 's observation that "thematic fit modeling is particularly sensitive to linguistic detail and interpretability of the vector space".
The present work sets itself among the unsupervised approaches to thematic fit estimation. By relying on explicit and interpretable count-based vector representations, we propose a simple, cognitively-inspired, and efficient thematic fit model using information extracted from dependency-parsed corpora. The key features of our proposal are a) prototypical representations of verb-specific thematic roles, based on feature weighting and filtering of second order contexts (i.e. contexts that are salient for many of the typical fillers of a given verb-specific thematic role), and b) a similarity measure which computes the Weighted Overlap (W O) between prototypes and candidate fillers. 3 2 Related Work Erk et al. (2010) were, at the best of our knowledge, the first authors to measure the correlation between human-elicited thematic fit ratings and the scores assigned by a syntax-based Distributional Semantic Model (DSM). More specifically, their gold standard consisted of the human judgments collected by McRae et al. (1998) and Padó (2007). The plausibility of each verb-filler pair was computed as the similarity between new candidate nouns and previously attested exemplars for each specific verb-role pairing (as already proposed in Erk (2007)). Baroni and Lenci (2010) evaluated their Distributional Memory (henceforth DM) 4 framework on the same datasets, adopting an approach to the task that has become dominant in the literature: for each verb role, they built a prototype vector by averaging the dependency-based vectors of its most typical fillers. The higher the similarity of a noun with a role prototype, the higher its plausibility as a filler for that role. Lenci (2011) has later extended the model to account for the dynamic update of the expectations on an argument, depending on how another role is filled. By using the same DM tensor, this study tested an additive and a multiplicative model (Mitchell and Lapata, 2010) to compose and update the expectations on the patient filler of the subject-verb-object triples of the Bicknell dataset (Bicknell et al., 2010).
The thematic fit models proposed by Sayeed and Demberg (2014) and Sayeed et al. (2015) are similar to Baroni and Lenci's, but their DSMs were built by using the roles assigned by the SENNA semantic role labeler (Collobert et al., 2011) to define the feature space. These authors argued that the prototype-based method with dependencies works well when applied to the agent and to the patient role (which are almost always syntactically realized as subjects and objects), but 3 Code: https://github.com/esantus/Thematic Fit 4 In this paper, we will make reference to two different models of DM: DepDM and TypeDM. DepDM counts the frequency of dependency links between words (e.g. read, obj, book), while TypeDM uses the variety of surface forms that express the link between words, rather than the link itself. that it might be problematic to apply it to different roles, such as instruments and locations, as the construction of the prototype would have to rely on prepositional complements as typical fillers, and the meaning of prepositions can be ambiguous. Comparing their results with Baroni and Lenci (2010), the authors showed that their system outperforms the syntax-based model DepDM and almost matches the scores of the best performing TypeDM, which uses hand-crafted rules. Moreover, they were the first to evaluate thematic role plausibility for roles other than agent and patient, as they computed the scores also for the instruments and for the locations of the Ferretti datasets (Ferretti et al., 2001). Greenberg et al. (2015a,b) further developed the TypeDM and the role-based models, investigating the effects of verb polysemy on human thematic fit judgments and introducing a hierarchical agglomerative clustering algorithm into the prototype creation process. Their goal was to cluster together typical fillers into multiple prototypes, corresponding to different verb senses, and their results showed constant improvements of the performance of the DM-based model.
Finally, Tilk et al. (2016) presented two neural network architectures for generating probability distributions over selectional preferences for each thematic role. Their models took advantage of supervised training on two role-labeled corpora to optimize the distributional representation for thematic fit modeling, and managed to obtain significant improvements over the other systems on almost all the evaluation datasets. They also evaluated their model on the task of composing and updating verb argument expectations, obtaining a performance comparable to Lenci (2011).

Methodology
As pointed out by , most works on unsupervised thematic fit estimation vary in the method adopted for constructing the prototypes. The semantic role prototype is usually a vector, obtained by averaging the most typical fillers, and plausibility of new fillers depends on their similarity to the prototype, assessed by means of vector cosine (the standard similarity measure for DSMs; see Turney and Pantel (2010)).
Its merits notwithstanding, we argue that this method is not optimal for characterizing roles. Distributional vectors are typically built as out-of-context representations, and they conflate different senses. By building the prototype as the centroid of a cluster of vectors and measuring then the thematic fit with vector cosine, the plausibility score is inevitably affected by many contexts that are irrelevant for the specific verb-argument combination. 5 This is likely to be one of the main reasons behind the difficulties of modeling roles other than agent and patient with syntax-based DSMs. We claim that improving the prototype representation might lead to a better characterization of thematic roles, and to a better treatment of polysemy.
When a verb and an argument are composed, humans are intuitively able to select only the part of the potential meaning of the words that is relevant for the concept being expressed (e.g. in The player hit the ball, humans would certainly exclude from the meaning of ball semantic dimensions that are strictly related to its dancing sense). In other words, not all the features of the semantic representations are active, and the composition process makes some features more 'prominent', while moving others to the background. 6 Although we are not aware of experimental works specifically dedicated to verb-argument composition, a similar idea has been supported in studies on conceptual combinations (Hampton, 1997(Hampton, , 2007: when a head and a modifier are combined, their interaction affects the saliency of the features in the original concepts. For example, in racing car, the most salient properties would be those related to SPEED, whereas in family car SPACE properties would probably be more prominent. Yeh and Barsalou (2006) used a property priming experiment to show how the concept features activated during language comprehension vary across the background situations described by the sentence they occur in. When concepts are combined in a sentence, the features that are relevant for the specific combination are activated and are then easier to verify for human subjects.
The same could be true for linguisticallyderived properties of lexical meaning:  brought neuroimaging evidence of the early activation of word association areas during property generation tasks, and Santos et al. (2011) showed that word associates are often among the properties generated for a given concept. Such findings suggest that, while we combine concepts, both embodied simulations and word distributions influence property salience .
Our model makes the following assumptions: • the composition between a verb role representation and an argument shares the same cognitive mechanism underlying conceptual combinations; • at least part of semantic representations is derived from, and/or mirrored in, linguistic data. 7 Consistently, the process of selecting the relevant features of the concepts being composed corresponds to modify the salience of the dimensions of distributional vectors; • thematic fit computation is carried out on the basis of the activation and selection of salient features of a verb thematic role prototype and of the candidate argument filler vectors.
We rely on syntax-based DSMs, using dependency relations to approximate verb-specific roles and to identify their most typical fillers: for agents/patients, we extract the most frequent subjects/objects, for instruments we use the prepositional complements introduced by with, and for locations those introduced by either on, at or in.
Assuming that the linguistic features of distributional vectors correspond to the properties of conceptual composition processes, a candidate filler can be represented as a sorted distributional vector of the filler term, in which the most salient contexts occupy the top positions. Similarly, the abstract representation of a verb-specific role is a sorted prototype-vector, whose features derive from the sum of the most typical filler vectors for that verb-specific role.
Differently from Baroni and Lenci, the core and novel aspect of our proposal, described in the following subsections, is that we do not simply measure the correlation between all the features of candidate and prototype vectors (as vector cosine would do on unsorted vectors), but rather we rank and filter the features, computing the weighted overlap with a rank-based similarity measure inspired by AP Syn, a recent proposal by Santus et al. (2016a,b,c) which has shown interesting results in synonymy detection and similarity estimation. As we will show in the next sections, the new metric assigns high scores to candidate fillers sharing many salient contexts with the verb-specific role prototype.

Typical Fillers
The first step of our method consists in identifying the typical fillers of a verb-specific role. Following Baroni and Lenci (2010), we weighted the raw cooccurrences between verbs, syntactic relations and fillers in the TypeDM tensor of DM with Positive Local Mutual Information (PLMI; Evert (2004)).
Given the co-occurrence count O vrf of the verb v, a syntactic relation r and the filler f , we computed the expected count E vrf under the assumption of statistical independence: From the ranked list of (v,r,f) tuples, for each slot, we selected as typical fillers the top k lexemes with the highest PLMI scores (see examples in Table 1, Typical Fillers column). In our experiments, we report results for k = {10, 30, 50}.

Role Prototype Vectors
To represent the typical fillers, the candidate fillers and the verb-specific role prototypes (which are obtained by summing their typical filler vectors), we built a syntax-based DSM. This includes relation:word contexts, like sbj:dog, obj:apple, etc.. Contexts were weighted with Positive Pointwise Mutual Information (PPMI; Church and Hanks (1990), Bullinaria and Levy (2012), Levy et al. (2015)). Given a context c and a word w, the PPMI is defined as follows: where w is the target word, c is the given context, P(w,c) is the probability of co-occurrence, and D is the collection of observed word-context pairs. 8 The context c of the prototype vector P representing a thematic role has a value corresponding to the sum of the values of c for each of the k typical fillers used to build P . The contexts of P are then sorted according to their weight. Desirably, the highest-ranking contexts for a role prototype will be those that are more strongly associated with many of its typical fillers. Such second order contexts correspond to the most salient features of the verb-specific thematic role, as they are salient for many role fillers (some examples are reported in Table 1, Top Second Order Contexts column).
In summary, we built centroid vectors for our verb-specific thematic roles by means of second order contexts, which are first order dependencybased contexts of the most typical fillers of a verbspecific role. Since we are interested only in the most salient contexts, we ranked the centroid contexts according to their PPMI score, and we took the resulting rank as a distributional characterization of the thematic roles.

Filtering the Contexts
Filtering the prototype dimensions according to syntactic criteria might be useful to improve our role representations. It is, indeed, reasonable to hypothesize that predicates co-occurring with the typical patients of a verb are more relevant for the characterization of its patient role than -let's sayprepositional complements, as they correspond to other actions that are typically performed on the same patients.
Imagine that apple, pizza, cake etc. are among the most salient fillers for the OBJ slot of to eat, and that OBJ-1:slice-v, OBJ-1:devour-v, SBJ:kidn, INSTRUMENT:fork-n, LOCATION:table-n are some of the most salient contexts of the prototype. 9 Things that are typically sliced and/or devoured are more likely to be good fillers for the patient role to eat than things that are simply located on a table or that are patients of actions performed by kids. To test this hypothesis, we evaluated the performance of the system in three different settings, each of which selecting: formance we will not discuss it further. Santus et al. (2016c) previously showed that their rank-based measure performs worse on PLMI-weighted vectors, as they are biased towards frequent contexts. 9 Our DSM also makes use of inverse syntactic dependencies: target SYN-1 context means that target is linked to context by the dependency relation SYN (e.g. meal OBJ-1 devour means that meal is OBJ of devour).
• only predicates in a subject/object relation (SO setting); • only prepositional complements (PREP setting); • both of them (ALL setting).

Computing the Thematic Fit
Our hypothesis is that fillers whose salienceranked vector has a large overlap with the prototype representation should have a high thematic fit. Such overlap should take into account not only the number of shared features, but also their respective ranks in the salience-ranked vectors.
When the prototype has been computed and the candidate filler vector has also been sorted, we can measure the Weighted Overlap by adapting AP Syn (Santus et al., 2016a,b,c) to our needs: where for every feature f in the intersection between the top N features of the sorted vectors x, x [1:N ] , and y, y [1:N ] , we sum 1 divided by the average rank of the shared feature in x and y, r x (f ) and r y (f ) (N is a tunable parameter). This measure assigns the maximum score to vectors sharing exactly the same dimensions, in the same salience ranking. The lower the rank of a shared context in the sorted vector, the smaller its contribution to the thematic fit score. If the feature set intersection is empty, the score will be 0.
Differently from cosine similarity, which conflates multiple senses, measuring the Weighted Overlap between prototype and candidate filler can improve the estimation of the thematic fit by favoring the appropriate word senses: for example, for a verb-argument pair like embracev-communism-n, communism-n is likely to intersect and to increase the saliency (through the average rank) only of the second-order features of embrace-v referring to its abstract sense.   Table 2 for the coverage of each system for the datasets).
Metrics. Performance is evaluated as the Spearman correlation between the scores of the systems and the human plausibility judgments. Fillers. In order to make our results more comparable with previous studies, the typical fillers for each verb role were extracted from the TypeDM tensor of the Distributional Memory framework (see Section 3.1). 10 Those were the same fillers used by Baroni and Lenci (2010) and Greenberg et al. (2015b). DSM. Distributional information is derived from the concatenation of two corpora: the British National Corpus (Leech, 1992) and Ukwac (Baroni et al., 2009). Both were parsed with the Maltparser (Nivre and Hall, 2005). From this concatenation, we built a dependency-based DSMs, weighted with PPMI, containing 20,145 targets (i.e. nouns and verbs with frequency above 1000) and 94,860 contexts. The syntactic relations taken into account were: sbj, sbj-1, obj, obj-1, at-1, in-1, on-1, with-1. Settings. To prove our hypotheses and verify the consistency of the system, we tested a large range of settings, varying:  • the number of fillers used to build the prototype, with the most typical values in the literature ranging between 10 and 50. We report the results for 10, 30 and 50 fillers • the types of the dependency relations used for calculating the overlap: we report results for the SO, PREP and ALL settings; • the value of N , that is the number of top contexts that we take into account when computing the weighted overlap. Table 3 reports the scores for our best setting, while the performances for other values of N are discussed in the Section 5.
Baseline and State of the Art. As a baseline, we use the thematic fit model by Baroni and Lenci (2010), with no ranking of the features of the prototypes and with vector cosine as a similarity metric. 11 Results are reported for 10, 30 and 50 fillers. For reference, we also report the results of state-of-the-art models, both the unsupervised (Baroni and Lenci, 2010;Sayeed and Demberg, 2014;Greenberg et al., 2015b) and the supervised ones (Tilk et al., 2016). Table 3 describes the performance of the best setting (weight: PPMI; N=2000). In the first three rows, the table shows the scores obtained by our system varying the types of dependency contexts (i.e. ALL, SO, PREP) and the number of fillers considered for the prototype (i.e. 10, 30 and 50). The other rows respectively show i) the scores obtained by calculating the vector cosine between the role prototype vector (i.e. the vector obtained by summing the most typical fillers, with no salience ranking of the dimensions) and the candidate filler vector and ii) the scores reported in the literature for the best unsupervised and supervised models. At a glance, our best scores always outperform the reimplementation of Baroni and Lenci, being mostly competitive with the state of the art models. More precisely, for agents and patients the performance is close to the reported scores for DM, when only predicates are used in the W O calculation, as hypothesized in Section 3.3. The neural network of Tilk and colleagues retains a significant advantage on our models only for the McRae dataset. Our system, however, shows a remarkable improvements on the Ferretti's datasets, and specifically on Ferretti-Instruments, when only complements are used (see Section 3.3), outperforming even the supervised and more complex model by Tilk et al. (2016), which has access to semantic roles information. Compared to the other unsupervised models, our system has a statistically significant advantage over Baroni and Lenci (2010) on the locations dataset and over Sayeed and Demberg (2014) on the locations and on the instruments dataset (p < 0.05). 12 At the best of our knowledge, the result for the    Table 4: Average gold values, number of items listed for both metrics, and distribution of syntactic and lexical forms among the 35 best and worst correlated items for every measure in the given datasets.

Results
instruments is the best reported until now in the literature. This is particularly interesting because -as pointed out by Sayeed and Demberg (2014) instruments and locations are difficult to model for a dependency-based system, given the ambiguity of prepositional phrases (e.g. with does not only encode instruments, but it can also encode other roles, such as in I ate a pizza with Mark). We think this is the main reason behind the different trend observed for the Instruments datasets with respect to the number of the fillers (see Table 3 and Figure 1). Unlike all the other datasets, instrument prototypes built with more fillers tend to be more noisy and therefore to pull down both the vector cosine and W O performance (this is partially true also for locations, where the performances -for cosine and W O with a lower number of contexts -drop with more than 30 fillers: see Figure 1). Systems based on semantic role labeling have an advantage in this sense, as they do not have to deal with prepositional ambiguity.
Our results show that, by weighting and filtering the features of the role prototype, dependency-based approaches can be successful in modeling roles other than agent and patient, eventually dealing also with the ambiguity of prepositional phrases.
Settings. Apart from the above-mentioned exceptions, the best scores are obtained building the prototypes with a higher number of fillers, typically with 50, and calculating the W O only with a syntactically-filtered set of contexts. More specifically, Padó and McRae benefit from the calculation of W O using only second order subject-object predicates (i.e. SO), while Ferretti-Instruments and Ferretti-Locations benefit from the exclusive use of prepositional complements (i.e. PREP). On the other hand, the opposite setting (e.g. SO for Ferretti-Instruments and Ferretti-Locations and PREP for Padó and McRae) leads to much lower scores, whereas the full vectors (i.e. ALL) tend to have a stable-but-not-excellent performances on all datasets.
As briefly mentioned above, in our experiments, we tested both PPMI and PLMI as weighting measures. Table 3 only reports PPMI scores because it performs more regularly than PLMI, whose behaviour is often unpredictable.
A parameter that has an impact on the performance of our system is the value of N , which is the number of second order contexts that are considered when calculating the W O. We have noticed that the performance of W O is directly related to the growth of N , and this can be noticed in Figure 1, where W O is plotted for the different values of N with every combination of dataset and number of fillers. For space reasons, the plot only contains the performance for the best type of second order contexts for each dataset (i.e. SO for Padó and McRae and COMP for Ferretti-Locations and Ferretti-Instruments). As it can be seen in Figure 1, the scores of W O tend to grow with the growth of N in all datasets. Interestingly, they are largely above the competitive baseline in most of the cases, the only exceptions being Padó (where a large N is necessary to outperform the baseline) and Ferretti-Locations with 10 fillers (prepositional ambiguity might have caused the introduction of noisy fillers among the top ones).
Agent & Patient. In order to further evaluate our system, we have split Padó and McRae datasets into agent and patient subsets. Figure 2 describes the performance of W O and vector cosine baseline while varying N and the number of fillers. The plot shows a clearly better performance of W O for the agent role (i.e. subject), especially when N is equal or over 1000 (note that the value of N has little impact in the agent subset of the McRae dataset). Such advantage, however, is reduced for the patient role (i.e. object). This is particularly interesting because we do not observe large drops in performance for the vector cosine between agent and patient role (except for Padó, k = 10). The drop is particularly noticeable in Padó, a dataset which has several non-constraining verbs (especially for the patient role: a similar observation was also made by Tilk et al. (2016)). As the constraints on the typical fillers of such verbs are very loose, we hypothesize that it is more difficult to find a set of salient features that are shared by many typical fillers. Therefore, estimations based on the whole vectors turn out to be more reliable. This can be confirmed by looking at the worst correlated words reported in Lexemes column, in Table 4.

Error Analysis
We performed an error analysis to verify -for the best settings of W O in each dataset -the correlation between vector cosine and W O scores (see Table 5), and the peculiarities of the entries with the strongest and the weakest correlation (see Table 4).
We found that W O and vector cosine always have a high correlation (i.e. above 0.80), with the highest correlations reported for McRae and Ferretti-Instruments. Looking at Table 4 we can also observe that: • the average gold value of the 35 most (4.65) and least (4.56) correlated items does not substantially differ from the average gold value calculated on the full datasets (4.31), meaning that the distribution of likely and unlikely fillers among the best and worst correlated items is similar to the one in the datasets (i.e. no bias can be identified); • both measures have difficulties on the same test items (probably because of loose semantic constraints), but report their best performances on different pairs (see Overlap and Lexemes columns); • syntactically, vector cosine correlates better with objects, while W O is more balanced between objects and subjects, often showing a preference for the latter (see the distribution in Syntax column).

Conclusions
In this paper, we have introduced an unsupervised distributional method for modeling predicateargument thematic fit judgments which works purely on syntactic information.  The method, inspired by cognitive and psycholinguistic findings, consists in: i) extracting and filtering the most salient second order contexts for each verb-specific role, i.e. the most salient semantic dimensions of typical verb-specific role fillers; and then ii) estimating the thematic fit as a weighted overlap between the top features of the candidate fillers and of the prototypes. Once tested on some popular datasets of thematic fit judgments, our method consistently outperforms a baseline re-implementing the thematic fit model of Baroni and Lenci (2010) and proves to be competitive with state of the art models. It even registered the best performance on the Ferretti-Instruments dataset and it is the second best on the Ferretti-Locations, which were known to be particularly hard to model for dependency-based approaches.
Our method is simple, economic and efficient, it works purely on syntactic dependencies (so it does not require a role-labeled corpus) and achieves good results even with no supervised training. Finally, it offers linguistically and cognitively grounded insights on the process of prototype creation and contextual feature salience, preparing the ground for further speculations and optimizations. For example, future work might aim at identifying strategies for tuning the parameter N to account for the different degrees of selectivity of each verb-specific role. Another possible extension would be the inclusion of a mechanism for updating the role prototypes depending on how the other roles are filled, which would be the key for a more realistic and dynamic model of thematic fit expectations (Lenci, 2011).