SP-10K: A Large-scale Evaluation Set for Selectional Preference Acquisition

Selectional Preference (SP) is a commonly observed language phenomenon and proved to be useful in many natural language processing tasks. To provide a better evaluation method for SP models, we introduce SP-10K, a large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English. Three representative SP acquisition methods based on pseudo-disambiguation are evaluated with SP-10K. To demonstrate the importance of our dataset, we investigate the relationship between SP-10K and the commonsense knowledge in ConceptNet5 and show the potential of using SP to represent the commonsense knowledge. We also use the Winograd Schema Challenge to prove that the proposed new SP relations are essential for the hard pronoun coreference resolution problem.


Introduction
Selectional Preference (SP) is a common phenomenon in human language that has been shown to be related to semantics (Wilks, 1975).Here by SP we mean that, given a word and a dependency relation, human beings have preferences for which words are likely to be connected.For instance, when seeing the verb 'sing', it is highly plausible that its object is 'a song', and when seeing the noun 'air', it is highly plausible that its modifier is 'fresh'.
SP has been shown to be useful over a variety of tasks including sense disambiguation (Resnik, 1997), semantic role classification (Zapirain et al., 2013), coreference clustering (Hobbs, 1978;Inoue et al., 2016;Heinzerling et al., 2017), and machine translation (Tang et al., 2016).Given the importance of SP, the automatic acquisition of SP has become a well-known research subject in the  (Keller and Lapata, 2003) 3 571 540 (Padó et al., 2006) 3 180 207 NLP community.However, current SP acquisition models are limited based on existing evaluation methods.We discuss two broadly used evaluation methods, human-labeled evaluation sets and the pseudo-disambiguation task.First, the most straightforward way to evaluate SP models is by asking human annotators.(McRae et al., 1998), (Keller and Lapata, 2003), and (Padó et al., 2006) proposed human-labeled SP evaluation sets containing hundreds of SP pairs (numbers are shown in Table 1).However, these datasets are too small to cover the diversity of the SP task adequately.Moreover, they only considered one-hop relations, such as 'verb-object' and 'modifier-noun' pairs.Aside from these relations, we believe that higher-order dependency relations may also reflect meaningful commonsense knowledge.Consider the following two examples of hard pronoun resolution problems from the Winograd Schema Challenge (Levesque et al., 2011): • (A) The fish ate the worm.It was hungry.
• (B) The fish ate the worm.It was tasty.
In (A), we can resolve 'it' to 'the fish' because it is more plausible that the subject of the verb 'eat' is hungry.On the other hand, for (B), we can resolve 'it' to 'the worm' because it is more likely that the object of the verb 'eat' is tasty.The above examples reflect the preferences between two twohop dependency relations: 'verb-object-modifier' and 'verb-subject-modifier', which have not been investigated in previous works.
Second, pseudo-disambiguation has been a popular alternative evaluation method for the SP acquisition task (Ritter et al., 2010;de Cruys, 2014).This way of SP acquisition trains a model based on pairs from a training corpus as positive examples and randomly generates fake pairs as negative examples, and then evaluates the model based on its ability on a test corpus by constructing positive and negative examples in the same way.However, the pseudo-disambiguation task only evaluates how well a model fits the data, which could be biased.The problem is that changing the corpus of training and testing may result in different conclusions.Thus, it is less robust than collecting SP pairs by asking expert annotators as (McRae et al., 1998), (Keller and Lapata, 2003), and (Padó et al., 2006), or even asking many ordinary people to vote for a commonsense agreement.
The problems of these methods motivate the creation of a large-scale human-labeled SP evaluation set based on crowdsourcing, which can be used as the ground truth for the SP acquisition task.
In this paper, we present SP-10K, which is unprecedented in both size and the number of SP relations.It contains 10,000 selectional triplets consisting of 2,500 frequent verbs, nouns, and adjectives in American English.Besides commonly used one-hop SP relations ('dobj', 'nsubj', and 'amod'), we introduce two novel two-hop SP relations ('dobj amod' and 'nsubj amod').We first evaluate three representative SP acquisition methods using SP-10K and compare the capacity of the state-of-the-art pseudo-disambiguation approaches.We then show the relationship between SP-10K and commonsense knowledge using ConceptNet5 (Speer and Havasi, 2012) to demonstrate the potential of using SP to represent commonsense knowledge.Finally, we use a subset of the Winograd Schema Challenge (Levesque et al., 2011) to prove that the proposed two-hop SP relations are essential for the hard pronoun coreference resolution.SP-10K is available at: https://github.com/HKUST-KnowComp/SP-10K.
First, similar to existing human-labeled SP evaluation sets (McRae et al., 1998;Keller and Lapata, 2003;Padó et al., 2006), SP-10K uses the plausibility of selectional pairs as the annotation.Hence, SP-10K is clearly defined.Second, compared to these existing evaluation sets, as shown in Table 1, SP-10K covers a larger number of relations and SP pairs, making it a more representative evaluation set.Finally, as discussed in Section 3.4, the annotation of SP-10K is consistent and reliable.

Selectional Relations
Traditionally, the study of SP has focused on three selectional relations: verb-subject, verb-object, and noun-adjective.As demonstrated in Section 1, some verbs have a preference for the properties of their subjects and objects.For example, it is plausible to say that the subject of 'eat' is hungry and the object of 'eat' is tasty, but not the other way round.To capture such preferences, we propose two novel two-hop dependency relations, 'dobj amod' and 'nsubj amod'.Examples of these relations are presented in Table 2.In total, SP-10K contains five SP relations.
Following previous approaches (McRae et al., 1998;Padó et al., 2006), for the 'dobj' and 'nsubj' relations, we take a verb as the head and a noun as the dependent.Similarly, for 'dobj amod' and 'nsubj amod' relations, we take a verb as the head and an adjective as the dependent.Moreover, for the 'amod' relation, we take a noun as the head and an adjective as the dependent.

Candidate SP Pairs
The selected vocabulary consists of 2,500 verbs, nouns, and adjectives from the 5,000 most frequent words1 in the Corpus of Contemporary American English.
For each SP relation, we provide two types of SP pairs for our annotators to label: frequent pairs and random pairs.For each selectional relation, we first select the 500 most frequent heads.We then match each head with its two most frequentlypaired dependents, as well as two randomly selected dependents from our vocabulary.As such, we retrieve 2,000 pairs for each relation.Altogether, we retrieve 10,000 pairs for five selectional relations.These pairs are composed of 500 verbs,  For the ease of understanding, the order of head and dependent may be different for various relations.
1,343 nouns, and 657 adjectives.Examples of sampled pairs are presented in Table 2.

Annotation of SP Pairs
We employ the Amazon Mechanical Turk platform (MTurk) for our annotations.2

Survey Design
Following the SimLex-999 annotation guidelines (Hill et al., 2015), we invite at least 11 annotators to score each SP pair.We divide our 10,000 pairs into 100 surveys.Each survey contains 103 questions, three of which are checkpoint questions selected from the examples to control the labeling quality.Within a survey, all the questions are derived from the same selectional relation to improve the efficiency of survey completion.
Each survey3 consists of three parts.We begin by explaining the task to the annotators, including how to deal with the special case like multi-word expressions.Then, we present three examples to help the annotators better understand the task.Finally, we ask questions using the following templates (VERB, ADJ, and NOUN are place holders and will be replaced with the corresponding heads and dependents in the actual surveys.): • dobj: How suitable do you think it is if we use NOUN as the object of the verb VERB?
• nsubj: How suitable do you think it is if we use NOUN as the subject of the verb VERB? • amod: How suitable do you think it is if we use ADJ to describe the noun NOUN?
• dobj amod: How suitable do you think it is if we use ADJ to describe the object of the verb VERB?
• nsubj amod: How suitable do you think it is if we use ADJ to describe the subject of the verb VERB?
For each question, the annotator is asked to select one of the following options: Perfectly match (5), Make sense (4), Normal (3), Seems weird (2), It's not applicable at all (1).We randomize the order of frequent and random pairs to prevent annotators from simply memorizing the question order.

Participants and Annotation
We require that our annotators are 'Master Workers', indicating reliable annotation records4 , and that our annotators are either native English speakers or currently live and/or work in Englishspeaking locales.Based on these criteria, we identified 125 valid annotators.These annotators produced 130,575 ratings for a total cost of USD1,182.80.We support the multiple participation of annotators by ensuring that subsequent surveys are generated with their previouslyunanswered questions.
From our annotation statistics, we notice that different selectional relations take different time to annotate.As shown in Figure 1, the annotators spent the least time on the 'amod' relation, suggesting that the modifying relation is relatively easy to understand and judge.Another interesting finding is that the annotators spend more time on relations involving subjects than those involving objects, which is consistent with the observation proposed by (Jackendoff, 1992)

Post-processing
We excluded ratings from annotators who (1) provided incorrect answers to any of the checkpoint questions or (2) demonstrated suspicious annotation patterns (e.g., marking all pairs as 'normal').
After excluding based on this criteria, we obtained 100,532 valid annotations with an overall acceptance rate of 77%.We calculate the plausibility for each SP pair by taking the average rating for the pair over all (at least 10) valid annotations, then linearly scaling this average from the 1-5 to 0-10 interval.This approach is similar to the postprocessing in (Hill et al., 2015).We present a sample of SP pairs in Table 3.Some of the pairs are interesting.For example, for the dobj amod relation, annotators agree that lifting a heavy object is a usually used expression, while earning a rubber object is rare.

Inner-Annotator Agreement
Following standard practices from previous datasets WSIM-203 (Reisinger and Mooney, 2010) and Simlex-999 (Hill et al., 2015), we employ Inter-Annotator Agreement (IAA), which computes the average correlation of an annotator with the average of all the other annotators, to evaluate the overall annotation quality.As presented in Table 4, the overall IAA of SP-10K is ρ = 0.75, which is comparable to existing datasets WSIM-203 (0.65) and Simlex-999 (0.78).Unsurprisingly, the IAA is not uniform across different SP relations.As shown in Table 4, complicated two-hop SP relations are more challenging and achieve relatively lower correlations than the simpler one-hop relations.This experimental result shows that two-hop relations are more difficult than one-hop SP relations.We also notice that the agreements among annotators for SP relations involving the subjects of verbs are relatively low.The above observations are consistent with our earlier discussion on annotation time, and further support the claim that verbs have stronger preferences for their objects than their subjects.

Evaluation of SP Acquisition Methods
To show the performance of existing SP acquisition methods and demonstrate the effect of different training corpora, we evaluate representative SP acquisition methods on SP-10K with following training corpora: (1) Wiki: Wikipedia is the largest free knowledge dataset.For this experiment, we select the English version of Wikipedia5 and filter out pages  containing fewer than 100 tokens and fewer than five hyperlinks.After filtering, our dataset contains over three million Wikipedia pages.
(2) Yelp: Yelp is a social media platform where users can write reviews for businesses, e.g., restaurants, hotels, etc.The latest release of the Yelp dataset6 contains over five million reviews.
We parsed these raw corpora using the Stanford dependency parser (Schuster and Manning, 2016).Detailed statistics are shown in Table 5.

Methods
We now introduce SP acquisition methods.
Posterior Probability (PP): (Resnik, 1997) proposes PP as a means of acquiring SP knowledge from raw corpora.Given a head h, a relation r, and a dependent d, PP uses the following probability to predict the plausibility: where C r (h) and C r (h, d) mean how many times p and the head-dependent pair (h, d) appear in the relation r respectively.Distributional Similarity (DS): (Erk et al., 2010) describes a method that uses corpus-driven DS metrics for the induction of SP.Given a head h, a relation r, and a dependent d, DS uses the following equation to predict the plausibility: where O r,h is the set of dependents that have been attested with head p and relation r, w(d, d ) is the weight function, and Z r,h is the normalization factor.We use the frequency of a pair of (h, d ) as the weighting function and the cosine similarity of their GloVe embedding (Pennington et al., 2014) as the similarity function s(d, d ), given the relative popularity of these embeddings.
Neural Network (NN): (de Cruys, 2014) proposes a NN-based method for the SP acquisition task.The main framework is a two-layer fullyconnected NN.For each SP pair (h, d), the framework uses the concatenation of embeddings [v h , v d ] as the input to the NN, where v h , v d are randomly initialized word embeddings for words h and d respectively.The ranking-loss (Collobert and Weston, 2008) is used as the training objective, where positive examples consist of all the SP pairs in the corpus and negative examples are randomly generated.During the training process, both model parameters and embeddings are jointly updated.We use the original paper's experimental setting to conduct our experiment.

Results and Analysis
We report the average Spearman ρ in Table 6 as our performance measure.We have following interesting observations.
(1) Choice of training corpus can influence the SP acquisition models.For the same method, the general corpora, i.e., Wiki and NYT, outperform the domain specific corpus, i.e., Yelp.Yelp performs best on the 'dobj' relation and comparably on the 'dobj amod' relation, which indicates the language use on Yelp may better reflect the plausibility of objects rather than of subjects.
However, this method has limited effectiveness on our dataset, which shows that pseudo-disambiguation cannot effectively represent ground truth SP.This further demonstrates the value of SP-10K as an evaluation set of SP acquisition.
(3) The overall performance of existing methods is quite lackluster, suggesting that these models insufficiently address the SP acquisition task.We hope that the release of our dataset will motivate efforts at deriving knowledge from SP and exploring the SP acquisition task.

SP and Commonsense Knowledge
In this section, we quantitatively analyze the relationship between SP and commonsense knowledge.Currently, the largest commonsense knowl-  edge dataset is the Open Mind Common Sense (OMCS) from the ConceptNet 5 (Speer and Havasi, 2012) knowledge base.The OMCS contains 600k crowdsourced commonsense triplets such as (food, UsedFor, eat) and (wind, Capa-bleOf, blow to east).All of the relations in OMCS are human-defined.In comparison, SP only relies on naturally occurring dependency relations, which can be accurately identified using existing parsing tools (Schuster and Manning, 2016).
We aim to demonstrate how SP related to commonsense knowledge.Building relationships between SP and human-defined relations has two advantages: (1) We may be able to directly acquire commonsense knowledge through SP acquisition techniques.(2) We may be able solve commonsense reasoning tasks from the perspective of SP, as illustrated through the two Winograd examples in Section 1.These advantages motivate exploring the potential of using SP to represent commonsense knowledge.

SP Pairs and OMCS Triplets
We hypothesize that the plausibility of an SP pair relates to how closely the pair aligns with human commonsense knowledge.As such, the more plausible pairs in SP-10K should be more likely to be covered by the OMCS dataset.
Using plausibility as our criterion, we split the 10,000 SP pairs into five groups: Perfect (8-10), Good (6-8), Normal (4-6), Unusual (2-4), and Impossible (0-2).As OMCS triplets contain phrases and SP pairs only contain words, we use two methods to match SP pairs with OMCS triplets.( 1  the two dependents are exactly the same as the two words in an SP pair. (2) Partial Match: we identify triplets in OMCS where the two dependents contain the two words in an SP pair.We count SP pairs that fulfill either of these matching methods as covered by OMCS.Note that exact matches are not double-counted as partial matches.
As shown in Table 7, almost 50% of SP pairs in the perfect group are covered by OMCS.In contrast, only about 6% of SP pairs from the impossible group are covered.More plausible selectional preference pairs are more likely to be covered by OMCS, which supports our hypothesis of more plausible SP pairs being more closely aligned with human commonsense knowledge.

SP and Human-defined Relations
To show the connection between SP relations and human-defined relations, we visualize all matching (SP pair, OMCS triplet) tuples in Figure 2. A darker color indicates a greater number of matched tuples, which in turn suggests a stronger connection between the two relations.

Case Study
We present a selection of covered pairs from the perfect and impossible groups in Table 8 7 .For the perfect group, we find that human-defined commonsense triplets often have neatly corresponding SP pairs.On the other hand, for the impossible group, SP pairs are covered by OMCS either because of incidental overlap with a non-keyword, e.g., 'child' in 'child wagon', or because of the low quality of some OMCS triplets.This further illustrates that OMCS still has room for improvement and that SP may provide an effective way to 7 More examples are provided in the appendix.improve commonsense knowledge.

Importance of Multi-hop SP
As introduced in Section 1, one novel contribution of this paper is the two-hop Selectional Preference relations: 'nsubj amod' and 'dobj amod'.To demonstrate their effectiveness, we select a subset8 of the Winograd Schema Challenge dataset (Levesque et al., 2011), which leverages the two-hop selectional preference knowledge to solve.In total, we have 72 questions out of overall 285 questions.The selected Winograd question is defined as follows: Given one sentence s containing two candidates (n 1 , n 2 ) and one pronoun p, which is described with one adjective a, we need to find which candidate is the pronoun referring to.One example is as follows: • Jim yelled at Kevin because he was so upset.
We need to correctly finds out he refers to Jim  rather than Kevin.These tasks are quite challenging as both the Stanford coreNLP coreference system and the current state-of-the-art end-to-end coreference model (Lee et al., 2018) cannot solve them.To solve that problem from the perspective of selectional preference (SP), we first parse the sentence and get the dependency relations related to the two candidates.If they appear as the subject or the object of the verb h, we will then check the SP score of the head-dependent pair (h, d) on relations 'nsubj amod' and 'dobj amod' respectively.
After that, we compare the SP score of two candidates and select the higher one as the prediction result.If they have the same SP score, we will make no prediction.We show the result of collected human-labeled data in 'SP-10K' and the best-performed model, Posterior Probability (PP), trained with Wikipedia corpus in Table 9.From the result, we can see that 'SP-10K' can solve that problem with very high precision.But as we only label 4,000 multihop pairs, the overall coverage is limited.On the other hand, automatic SP acquisition method PP can cover more questions, but the precision also drops due to the noise of the collected SP knowledge.The experimental result shows that if we can automatically build a good multi-hop SP model, we could make some steps towards solving the hard pronoun coreference task, which is viewed a vital task of natural language understanding.

Related Work
As one important language phenomenon, SP is considered related to the Semantics Fit (McRae et al., 1998) and has been proved helpful in a series of downstream tasks including machine translation (Tang et al., 2016), sense disambiguation (Resnik, 1997), coreference resolution (Hobbs, 1978;Inoue et al., 2016;Zhang and Song, 2018), and semantic role classification (Zapirain et al., 2013).
Several algorithms attempt to acquire SP automatically from raw corpora (Resnik, 1997;Rooth et al., 1999;Erk et al., 2010;Santus et al., 2017).However, (Mechura, 2008) reveals that creating a high-quality SP model is difficult due to the noisiness and ambiguity of raw corpora.Several approaches attempt to address this issue by applying state-of-the-art word embeddings and neural networks to the automatic acquisition of SP (Levy and Goldberg, 2014;de Cruys, 2014).Despite these efforts, the quality of learned SP models remains questionable due to the shortcomings of existing SP acquisition evaluation methods.
Currently, the most popular evaluation method for SP acquisition is the pseudodisambiguation (Ritter et al., 2010;de Cruys, 2014).
However, pseudo-disambiguation can be easily influenced by the aforementioned noisiness of evaluation corpora and cannot represent ground truth SP.Experiments in this paper prove that the model performs well on the pseudo-disambiguation task may not correlate well with the human-labeled ground truth.As for the ground truth, there are three human-labeled ground truth SP evaluation sets (McRae et al., 1998;Keller and Lapata, 2003;Padó et al., 2006).These evaluation sets score SP pairs based on their plausibility as determined by human evaluators.However, these datasets are of small sizes.Compared to current evaluation methods, SP-10K is a human-annotated large-scale evaluation set and contains 10,000 SP pairs over five SP relations.

Conclusion
In this work, we present SP-10K, a large-scale human-labeled evaluation set for selectional preference.Compared with other evaluation methods, SP-10K has much larger coverage and can better represent ground truth SP.Two novel two-hop SP relations 'dobj amod' and 'nsubj amod' are also introduced.We evaluate three representative SP acquisition methods with our dataset.After that, we demonstrate the potential of using SP to represent commonsense knowledge, which can be beneficial for the acquisition and application of commonsense knowledge.In the end, we demonstrate the importance of the two-hop relations with a subset of the Winograd Schema Challenge.

Figure 1 :
Figure 1: Average annotation time per 100 questions.'m' indicates minutes and 's' indicates seconds.

Table 1 :
Statistics of Human-labeled SP Evaluation Sets.#R, #W, and #P indicate the number of SP relation types, words, and pairs, respectively.

Table 2 :
Examples of candidate pairs for annotation.
that verbs have clearer preferences for objects than subjects.

Table 3 :
Sampled SP pairs from SP-10K and their plausibility ratings.object and subject are place holders to help understand the two-hop SP relations.

Table 6 :
Performance of different corpora and methods on SP-10K.Average Spearman ρ scores are reported.indicates statistical significant (p <0.005) over DS and † indicates statistical significant (p <0.005) over NN.For each SP relation, rows represent different acquisition methods and columns represent different corpora.The best performed model for each relation is annotated with bold font.

Table 7 :
Matching statistics of SP pairs by plausibility. observed.

Table 8 :
Examples of OMCS-covered SP pairs and their corresponding OMCS triplets.

Table 9 :
Result of different models on the subset of Winograd Schema Challenge.N A means that the model cannot give a prediction, A p means the accuracy of predict examples without N A examples, and A o means the overall accuracy.