A Proposal for Linguistic Similarity Datasets Based on Commonality Lists

Similarity is a core notion that is used in psychology and two branches of linguistics: theoretical and computational. The similarity datasets that come from the two fields differ in design: psychological datasets are focused around a certain topic such as fruit names, while linguistic datasets contain words from various categories. The later makes humans assign low similarity scores to the words that have nothing in common and to the words that have contrast in meaning, making similarity scores ambiguous. In this work we discuss the similarity collection procedure for a multi-category dataset that avoids score ambiguity and suggest changes to the evaluation procedure to reflect the insights of psychological literature for word, phrase and sentence similarity. We suggest to ask humans to provide a list of commonalities and differences instead of numerical similarity scores and employ the structure of human judgements beyond pairwise similarity for model evaluation. We believe that the proposed approach will give rise to datasets that test meaning representation models more thoroughly with respect to the human treatment of similarity.


Introduction
Similarity is the degree of resemblance between two objects or events (Hahn, 2014) and plays a crucial role in psychological theories of knowledge and behaviour, where it is used to explain such phenomena as classification and conceptualisation. Fruit is a category because it is a practical generalisation. Fruits are sweet and constitute deserts, so when one is presented with an unknown fruit, one can hypothesise that it is served toward the end of a dinner.
Generalisations are extremely powerful in describing a language as well. The verb runs requires its subject to be singular. Verb, subject and singular are categories that are used to describe English grammar. When one encounters an unknown word and is told that it is a verb, one will immediately have an idea about how to use it assuming that it is used similarly to other English verbs.
The semantic formalisation of similarity is based on two ideas. The occurrence pattern of a word defines its meaning (Firth, 1957), while the difference in occurrence between two words quantifies the difference in their meaning (Harris, 1970). From a computational perspective, this motivates and guides development of similarity components that are embedded into natural language processing systems that deal with tasks such as word sense disambiguation (Schütze, 1998), information retrieval (Salton et al., 1975;Milajevs et al., 2015), machine translation (Dagan et al., 1993), dependency parsing (Hermann and Blunsom, 2013;Andreas and Klein, 2014), and dialogue act tagging (Kalchbrenner and Blunsom, 2013;Milajevs and Purver, 2014).
Because it is difficult to measure performance of a single (similarity) component in a pipeline, datasets that focus on similarity are popular among computational linguists. Apart from a pragmatic attempt to alleviate the problems of evaluating similarity components, these datasets serve as an empirical test of the hypotheses of Firth and Harris, bringing together our understanding of human mind, language and technology.
Two datasets, namely MEN (Bruni et al., 2012) and SimLex-999 (Hill et al., 2015), are currently widely used. They are designed especially for meaning representation evaluation and surpass datasets stemming from psychology (Tversky and Hutchinson, 1986), information retrieval (Finkelstein et al., 2002) and computational linguistics (Rubenstein and Goodenough, 1965) in quantity by having more entries and, in case of SimLex-999, attention to the evaluated relation by distinguishing similarity from relatedness. The datasets provide similarity (relatedness) scores between word pairs.
In contrast to linguistic datasets which contain randomly paired words from a broad selection, datasets that come from psychology contain entries that belong to a single category such as verbs of judging (Fillenbaum and Rapoport, 1974) or animal terms (Henley, 1969). The reason for category oriented similarity studies is that "stimuli can only be compared in so far as they have already been categorised as identical, alike, or equivalent at some higher level of abstraction" (Turner et al., 1987). Moreover, because of the extension effect (Medin et al., 1993), the similarity of two entries in a context is less than the similarity between the same entries when the context is extended. "For example, black and white received a similarity rating of 2.2 when presented by themselves; this rating increased to 4.0 when black was simultaneously compared with white and red (red only increased 4.2 to 4.9)" (Medin et al., 1993). In the first case black and white are more dissimilar because they are located on the extremes of the greyscale, but in the presence of red they become more similar because they are both monochromes.
Both MEN and SimLex-999 provide pairs that do not share any similarity to control for false positives, and they do not control for the comparison scale. This makes similarity judgements ambiguous as it is not clear what low similarity values mean: incomparable notions, contrast in meaning or even the difference in comparison context. SimLex-999 assigns low similarity scores to the incomparable pairs (0.48, trick and size) and to antonymy (0.55, smart and dumb), but smart and dumb have relatively much more in common than trick and size! The present contribution investigates how a similarity dataset with multiple categories should be built and considers what sentence similarity means in this context.

Dataset Construction
Human similarity judgements To build a similarity dataset that contains non-overlapping categories, one needs to avoid comparison of incomparable pairs. However, that itself requires an a priori knowledge of item similarity or belongingness to a category, making the problem circular.
To get out of this vicious circle, one might erroneously refer to an already existing taxonomy such as WordNet (Miller, 1995). But in case of similarity, as Turney (2012) points out, categories that emerge from similarity judgements are different from taxonomies. For example, traffic and water might be considered to be similar because of a functional similarity exploited in hydrodynamic models of traffic, but their lowest common ancestor in WordNet is entity.
Since there is no way of deciding upfront whether there is a similarity relation between two words, the data collection procedure needs to test for both: relation existence and its strength. Numerical values, as has been shown in the introduction, do not fit this role due to ambiguity. One way to avoid the issue is to avoid asking humans for numerical similarity judgements, but instead to ask them to list commonalities and differences between the objects. As one might expect, similarity scores correlate with the number of listed commonalities (Markman and Gentner, 1991;Markman and Gentner, 1996;Medin et al., 1993). For incomparable pairs, the commonality list should be empty, but the differences will enumerate properties that belong to one entity, but not to another (Markman and Gentner, 1991;Medin et al., 1993).
Verbally produced features (norms) for empirically derived conceptual representation of McRae et al. (2005) is a good example of what and how the data should be collected. But in contrast to McRae et al. (2005)-where explicit comparison of concepts was avoided-participants should be asked to produce commonalities as part of similarity comparison.
The entries in the dataset So far, we have proposed a similarity judgement collection method that is robust to incomparable pairings. It also naturally gives rise to categories, because the absence of a relation between two entries means the absence of a common category. It still needs to be decided which words to include in the dataset.
To get a list of words that constitute the dataset, one might think of categories such as sports, fruits, vegetables, judging verbs, countries, colours and so on. Note, that at this point its acceptable to think of categories, because later the arbitrary category assignments will be reevaluated. Once the list of categories is ready, each of them is populated with category instances, e.g. plum, banana and lemon are all fruits.
When the data is prepared, humans are asked to provide commonalities and differences between all pairs of every group. First, all expected sim-ilarities are judged, producing a dataset that can be seen as a merged version of category specific datasets. At this point, a good similarity model should provide meaning representation that are easily split to clusters: fruit members and sport members have to be separable.
Intra-category comparisons should be also performed, but because it is impractical to collect all possible pairwise judgements between the number of words of magnitude of hundreds, a reasonable sample should be taken. The intra-category comparisons will lead to unexpected category pairings, such as food that contains vegetables and fruits, so the sampling procedure might be directed by the discovery of comparable pairs: when a banana and potato are said to be similar, fruits and vegetables members should be more likely to be assessed.
Given the dynamic nature of score collection, we suggest setting up a game with a purpose (see Venhuizen et al. (2013) an example) where players are rewarded for contributing their commonality lists. Another option would be to crowdsource the human judgements (Keuleers and Balota, 2015).
Evaluation beyond proximity Human judgements validate the initial category assignment of items and provide new ones. If a category contains a superordinate, similarity judgements arrange category members around it (Tversky and Hutchinson, 1986). For example, similarity judgements given by humans arrange fruit names around the word fruit in such a way that it is their nearest neighbour, making fruit the focal point of the category of fruits.
As an additional evaluation method, the model should be able to retrieve focal points. Therefore, a precaution should be taken before human judgement collection. If possible, categories should contain a superordinate.
Similarity evaluation needs to focus on how well a model is able to recover human similarity intuitions expressed as groupings, possibly around their focal points. We propose to treat it as a soft multi-class clustering problem (White et al., 2015), where two entities belong to the same class if there is a similarity judgement for them (e.g. apple and banana are similar because they are fruits) and the strength is proportional to the number of such judgements, so we could express that apple is more a fruit than it is a company.
In contrast to the current evaluation based on correlation, models also need to be tested on the geometric arrangement of subordinates around the focal points, as only the proximity based evalua-tion does not capture this (Tversky and Hutchinson, 1986).

Sentence Similarity
The question of sentence similarity is more complex because sentences in many ways are different entities than words. Or are they? Linguistics has recently often pointed toward a continuum which exists between words and sentences (Jackendoff, 2012). Jackendoff and Pinker (2005), for example, point out that there is good evidence that "human memory must store linguistic expressions of all sizes." These linguistic expressions of variable size are often called constructions. Several computational approaches to constructions have been proposed (Gaspers et al., 2011;Chang et al., 2012), but to the authors' best knowledge they do not yet feature prominently in natural language processing.
To be able to measure the similarity of phrases and sentences in the proposed framework, we need to be able to identify what could serve as commonalities between them. So what are they? First of all, words, sentences and other constructions draw attention to states of affairs around us. Also, sentences are similar to others with respect to the functions they perform (Winograd, 1983, p. 288). (2009) points out, speakers of English can make sense of phrases like X floosed Y the Z and X was floosed by Y. This is due to their similarity to sentences such as John gave Mary the book and Mary was kissed by John respectively. Thus, X floosed Y the Z is clearly a transfer of possession or dative (Bresnan et al., 2007).

Prototype effects As Tomasello
The amount in which sentences are similar, at least to a certain extent, corresponds to the function of a given sentence (the ideational function (Winograd, 1983, p. 288) especially). Tomasello (1998) points out that sentence-level constructions show prototype effects similar to those discussed above for lexical systems (e.g. colours). Consider the following sentences: • John gave Mary the book. is a example of an Agent Causes Transfer construction. These usually are build around words such as give, pass, hand, toss, bring, etc. • John promised Mary the book. is a example of an Conditional transfer construction. These usually are build around words such as promise, guarantee, owe, etc. As soon as one has such a prototype network, one can actually decide sentence similarity as one can say with respect to what prototypes sentences and utterances are similar. In this case, a common sentence prototype serves the same role as commonality between words.
Similarity in context However, prototype categories work on the semantic-grammatical level, and might be handled by similarity in context: a noun phrase can be similar to a noun as in female lion and lioness, and to another noun phrases as in yellow car and cheap taxi. The same similarity principle can be applied to phrases as to words. In this case, similarity is measured in context, but it is still a comparison of the phrases' head words of which meaning is modified by arguments they appear with (Kintsch, 2001;Mitchell and Lapata, 2008;Mitchell and Lapata, 2010;Dinu and Lapata, 2010;Baroni and Zamparelli, 2010;Thater et al., 2011;Séaghdha and Korhonen, 2011). With verbs this idea can be applied to compare transitive verbs with intransitive. For example, to cycle is similar to to ride a bicycle.
Sentential similarity might be treated as the similarity of the heads in the contexts. That is, the similarity between sees and notices in John sees Mary and John notices a woman. This approach abstracts away from grammatical differences between the sentences and concentrates on semantics and fits the proposed model as the respect for the head, which is a lexical entity, has to be found (Corbett and Fraser, 1993).

Attention attraction
But still, what about pragmatics? As Steels (2008) points out, sentences and words direct attention and do not always directly point or refer to entities and actions in the world. For example, he points to the fact that if a person asks another person to pass the wine they are actually asking for the bottle. The speaker just attracts attention to an object of perception in a given situation.

Grammaticalisaton and lexicalisaton
There are several ways in which a sentence can both be grammaticalised and lexicalised. For example, No and I've seen John eating them are similar sentences because they lexicalise the same answer to the question Do we have cookies? More generally, this gives rise to dialogue act tags: for another way of utterance categorisation, refer to the work of Kalchbrenner and Blunsom (2013) and Milajevs and Purver (2014).
Thus, questions which the sentences answer, are valid respects for similarity explanation, as well as entailment, paraphrase (White et al., 2015) or spatial categories (Ritter et al., 2015). This also mo-tivates the approach of treating sentences on their own and encoding the meaning of a sentence into a vector in such a way that similar sentences are clustered together (Coecke et al., 2010;Baroni et al., 2014;Socher et al., 2012;Wieting et al., 2015;Hill et al., 2016).
Discourse fit If one conceptualises sentence similarity with respect to a discourse, then one might ask how different sentences fit in to such a discourse. Griffiths et al. (2015) tried to construct two versions of the same dialogue using a bottomup method. They deconstructed a certain dialogue in a given domain-a receptionist scenario-into greetings, directions and farewells. They used a small custom made corpus for this purpose and created the two dialogues by having people rate the individual utterances by friendliness. The resulting two dialogues were surprisingly uneven. The dialogue was supposed to give instructions to a certain location within a building. The "friendly version" was very elaborated and consisted of several sentences: (1) The questionnaire is located in room Q2-102.
That is on the second floor. If you turn to your right and walk down the hallway. At the end of the floor you will find the stairs. Just walk up the stairs to the top floor and go through the fire door. The room is then straight ahead.
The sentence which served the same purpose in the "neutral version" was a fairly simple sentence: (2) The questionnaire is located in Q2-102.
Often the same function of a given sentence in a dialogue can be performed by as little as one word or several phrases or a different sentence or even a complete story. Steels (2010) introduces the idea of language subsystems and language strategies. A language subsystem are the means of expressing certain related or similar meanings. Examples of such subsystems include:

Language sub-systems and strategies
• Lexical systems which express colours.
• Morphological devices to encode tenses.
• Usage of word order to express relations between agent and patient. The later is an illustration of a language strategy. In English agent-patient relations are mainly encoded by syntax whereas German would use intonation and a combination or articles and case to convey the same information. Russian, in contrast, will use morphological devices for the same purpose. Hence, for some purposes the entities which are similar may not be of clearly delineated categories such as "word" or "sentence" but may be of chunks of language which belong to the same sub-system.
Above we identified seven criteria by which sentence similarity can be compared. The instructions for the sentence similarity judgement tasks may incorporate the criteria as hints for human participants during data collection.

Conclusion
In this contribution we discussed the notion of similarity from an interdisciplinary perspective. We contrasted properties of the similarity relation described in the field of psychology with the characteristics of similarity datasets used in computational linguistics. This lead to the recommendations on how to improve the later by removing low score ambiguity in a multi-category similarity dataset.
In the future, a multi-category similarity dataset should be build that allow evaluation of vector space models of meaning by not only measuring proximity between the points, but also their arrangement with respect to clusters. The same ideas can be used to build phrase-and sentencelevel datasets. However, we leave the exact sentence similarity criteria selection for future work in this area.
On a broader perspective, this work highlights psychological phenomena that being incorporated into the models of meaning are expected to improve their performance.