Polish evaluation dataset for compositional distributional semantics models

The paper presents a procedure of building an evaluation dataset. for the validation of compositional distributional semantics models estimated for languages other than English. The procedure generally builds on steps designed to assemble the SICK corpus, which contains pairs of English sentences annotated for semantic relatedness and entailment, because we aim at building a comparable dataset. However, the implementation of particular building steps significantly differs from the original SICK design assumptions, which is caused by both lack of necessary extraneous resources for an investigated language and the need for language-specific transformation rules. The designed procedure is verified on Polish, a fusional language with a relatively free word order, and contributes to building a Polish evaluation dataset. The resource consists of 10K sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish.

1 Introduction and related work

Distributional semantics
The basic idea of distributional semantics, i.e. determining the meaning of a word based on its co-occurrence with other words, is derived from the empiricists - Harris (1954) and Firth (1957). John R. Firth drew attention to the contextdependent nature of meaning especially with his 1 The dataset is obtainable at: http://zil.ipipan.waw.pl/Scwad/CDSCorpus famous maxim "You shall know a word by the company it keeps" (Firth, 1957, p. 11).
Nowadays, distributional semantics models are estimated with various methods, e.g. word embedding techniques (Bengio et al., 2003(Bengio et al., , 2006Mikolov et al., 2013). To ascertain the purport of a word, e.g. bath, you can use the context of other words that surround it. If we assume that the meaning of this word expressed by its lexical context is associated with a distributional vector, the distance between distributional vectors of two semantically similar words, e.g bath and shower, should be smaller than between vectors representing semantically distinct words, e.g. bath and tree.

Compositional distributional semantics
Based on empirical observations that distributional vectors encode certain aspects of word meaning, it is expected that similar aspects of the meaning of phrases and sentences can also be represented with vectors obtained via composition of distributional word vectors. The idea of semantic composition is not new. It is well known as the principle of compositionality: 2 "The meaning of a compound expression is a function of the meaning of its parts and of the way they are syntactically combined." (Janssen, 2012, p. 19).
Modelling the meaning of textual units larger than words using compositional and distributional information is the main subject of compositional distributional semantics (Mitchell and Lapata, 2010; Baroni and Zamparelli, 2010;Grefenstette and Sadrzadeh, 2011;Socher et al., 2012, to name a few studies). The fundamental principles of compositional distributional semantics, henceforth referred to as CDS, are mainly propagated with papers written on the topic. Apart from the papers, it was the SemEval-2014 Shared Task 1  that essentially contributed to the expansion of CDS and increased an interest in this domain. The goal of the task was to evaluate CDS models of English in terms of semantic relatedness and entailment on proper sentences from the SICK corpus.

The SICK corpus
The SICK corpus  consists of 10K pairs of English sentences containing multiple lexical, syntactic, and semantic phenomena. It builds on two external data sources -the 8K ImageFlickr dataset (Rashtchian et al., 2010) and SemEval-2012 Semantic Textual Similarity dataset (Agirre et al., 2012). Each sentence pair is human-annotated for relatedness in meaning and entailment.
The relatedness score corresponds to the degree of semantic relatedness between two sentences and is calculated as the average of ten human ratings collected for this sentence pair on the 5-point Likert scale. This score indicates the extent to which the meanings of two sentences are related.
The entailment relation between two sentences, in turn, is labelled with entailment, contradiction, or neutral. According to the SICK guidelines, the label assigned by the majority of human annotators is selected as the valid entailment label.

Motivation and organisation of the paper
Studying approaches to various natural language processing (henceforth NLP) problems, we have observed that the availability of language resources (e.g. training or testing data) stimulates the development of NLP tools and the estimation of NLP models. English is undoubtedly the most prominent in this regard and English resources are the most numerous. Therefore, NLP methods are mostly designed for English and tested on English data, even if there is no guarantee that they are universal. In order to verify whether an NLP algorithm is adequate, it is not enough to evaluate it solely for English. It is also valuable to have high-quality resources for languages typologically different to English. Hence, we aim at building datasets for the evaluation of CDS models in languages other than English, which are often underresourced. We strongly believe that the availability of test data will encourage development of CDS models in these languages and allow to better test the universality of CDS methods.
We start with a high-quality dataset for Polish, which is a completely different language than English in at least two dimensions. First, it is a rather under-resourced language in contrast to the resource-rich English. Second, it is a fusional language with a relatively free word order in contrast to the isolated English with a relatively fixed word order. If some heuristics is tested on e.g. Polish, the evaluation results can be approximately generalised to other Slavic languages. We hope the Slavic NLP community will be interested in designing and evaluating methods of semantic modelling for Slavic languages.
The procedure of building an evaluation dataset for validating compositional distributional semantics models of Polish generally builds on steps designed to assemble the SICK corpus (described in Section 1.3) because we aim at building an evaluation dataset which is comparable to the SICK corpus. However, the implementation of particular building steps significantly differs from the original SICK design assumptions, which is caused by both lack of necessary extraneous resources for Polish (see Section 2.1) and the need for Polish-specific transformation rules (see Section 2.2). Furthermore, the rules of arranging sentences into pairs (see Section 2.3) are defined anew taking into account the characteristic of data and bidirectional entailment annotations, since an entailment relation between two sentences must not be symmetric. Even if our assumptions of annotating sentence pairs coincide with the SICK principles to a certain extent (see Section 3.1), the annotation process differs from the SICK procedure, in particular by introducing an element of human verification of correctness of automatically transformed sentences (see Section 3.2) and some additional post-corrections (see Section 3.3). Finally, a summary of the dataset is provided in Section 4.1 and the dataset evaluation is given in Section 4.2.
2 Procedure of collecting data 2.1 Selection and description of images The first step of building the SICK corpus consisted in the random selection of English sentence pairs from existing datasets (Rashtchian et al., 2010;Agirre et al., 2012). Since we are not aware of accessibility of analogous resources for Polish, we have to select images first and then describe the selected images.
Images are selected from the 8K ImageFlickr dataset (Rashtchian et al., 2010). At first we wanted to take only these images the descriptions of which were selected for the SICK corpus. However, a cursory check shows that these images are quite homogeneous, with a predominant number of dogs depictions. Therefore, we independently extract 1K images and split them into 46 thematic groups (e.g. children, musical instruments, motorbikes, football, dogs). The numbers of images within individual thematic groups vary from 6 images in the volleyball and telephoning groups to 94 images in the various people group. The second largest groups are children and dogs with 50 images each. The chosen images are given to two authors who independently of each other formulate their descriptions based on a short instruction. The authors are instructed to write one single sentence (with a sentence predicate) describing the action in a displayed image. They should not describe an imaginable context or an interpretation of what may lie behind the scene in the picture. If some details in the picture are not obvious, they should not be described either. Furthermore, the authors should avoid multiword expressions, such as idioms, metaphors, and named entities, because those are not compositional linguistic phenomena. Finally, descriptions should contain Polish diacritics and proper punctuation.

Transformation of descriptions
The second step of building the SICK corpus consisted in pre-processing extracted sentences, i.e. normalisation and expansion (Bentivogli et al., 2014, p. 3-4). Since the authors of Polish descriptions are asked to follow the guidelines (presented in Section 2.1), the normalisation step is not essential for our data. The expansion step, in turn, is implemented and the sentences provided by the authors are lexically and syntactically transformed in order to obtain derivative sentences with similar, contrastive, or neutral meanings. The following transformations are implemented: 1. dropping conjunction concerns sentences with coordinated predicates sharing a subject, e.g. Rowerzysta odpoczywa i obserwuje morze. (Eng. 'A cyclist is resting and watching the sea.'). The finite form of one of the coordinated predicates is transformed into: • an active adjectival participle, e.g. Odpoczywający rowerzysta obserwuje The first five transformations are designed to produce sentences with a similar meaning, the sixth transformation outputs sentences with a contradictory meaning, and the seventh transformation should generate sentences with a neutral (or unrelated) meaning. All transformations are performed on the dependency structures of input sentences (Wróblewska, 2014). Some of the transformations are very productive (e.g. mixing dependents). Other, in turn, are sparsely represented in the output (e.g. dropping conjunction). The number of transformed sentences randomly selected to build the dataset is in the second column of

Data ensemble
The final step of building the SICK corpus consisted in arranging normalised and expanded sentences into pairs. Since our data diverges from SICK data, the process of arranging Polish sentences into pairs also differs from pairing in the SICK corpus. The general idea behind the pair-ensembling procedure was to introduce sentence pairs with different levels of relatedness into the dataset. Apart from pairs connecting two sentences originally written by humans (as described in Section 2.1), there are also pairs in which an original sentence is connected with a transformed sentence. For each of the 1K images, the following 10 pairs are constructed (for A being the set of all sentences originally written by the first author, B being the set of all sentences originally written by the second author, a ∈ A and b ∈ B being the original descriptions of the picture): 1. (a, b) 2. (a, a 1 ), where a 1 ∈ t(a), and t(a) is the set of all transformations of the sentence a 10. (a, a 5 ), where a 5 ∈ t(a), a 5 = a 1 for 50% images, (b, b 5 ) (analogously) for other 50%. 5 For each sentence pair (a, b) created according to this procedure, its reverse (b, a) is also included in our corpus. As a result, the working set consists of 20K sentence pairs.

Corpus annotation 3.1 Annotation assumptions
The degree of semantic relatedness between two sentences is calculated as the average of all human ratings on the Likert scale with the range from 0 to 5. Since we do not want to excessively influence 3 The thematic group of a sentence a corresponds to the thematic group of an image being the source of a (as described in Section 2.1). 4 The pairs (a, a4) of the same authors' descriptions of two images from different thematic groups are expected to be unrelated. The same applies to (b, b4). 5 A repetition of point 2 with a restriction that a different pair is created (pairs of very related sentences are expected).
We alternate between authors A and B to obtain equal author proportions in the final ensemble of pairs. the annotations, the guidelines given to annotators are mainly example-based: 6 • 5 (very related): Kot siedzi na płocie.
(Eng. 'A cat is sitting on the fence.') vs. Na płocie jest duży kot. (Eng. 'There is a large cat on the fence.'), • Apart from these examples, there is a note in the annotation guidelines indicating that the degree of semantic relatedness is not equivalent to the degree of semantic similarity. Semantic similarity is only a special case of semantic relatedness, semantic relatedness is thus a more general term than the other one. Polish entailment labels correspond directly to the SICK labels (i.e. entailment, contradiction, neutral). The entailment label assigned by the majority of human judges is selected as the gold label. The entailment labels are defined as follows: • a wynika z b (b entails a) -if a situation or an event described by sentence b occurs, it is recognised that a situation or an event described by a occurs as well, i.e. a and b refer to the same event or the same situation, • a jest zaprzeczeniem b (a is the negation of b) -if a situation or an event described by b occurs, it is recognised that a situation or an event described by a may not occur at the same time, • a jest neutralne wobec b (a is neutral to b)the truth of a situation described by a cannot be determined on the basis of b.

Annotation procedure
Similar to the SICK corpus, each Polish sentence pair is human-annotated for semantic relatedness and entailment by 3 human judges experienced in Polish linguistics. 7 Since for each annotated pair (a, b), its reverse (b, a) is also subject to annotation, the entailment relation is in practice determined 'in both directions' for 10K sentence pairs. For the task of relatedness annotation, the order of sentences within pairs seems to be irrelevant, we can thus assume to obtain 6 relatedness scores for 10K unique pairs. Since the transformation process is fully automatic and to a certain extent based on imperfect dependency parsing, we cannot ignore errors in the transformed sentences. In order to avoid annotating erroneous sentences, the annotation process is divided into two stages: 1. a sentence pair is sent to a judge with the leader role, who is expected to edit and to correct the transformed sentence from this pair before annotation, if necessary, 2. the verified and possibly enhanced sentence pair is sent to the other two judges, who can only annotate it.
The leader judges should correct incomprehensible and ungrammatical sentences with a minimal number of necessary changes. Unusual sentences which could be accepted by Polish speakers should not be modified. Moreover, the modified sentence may not be identical with the other sentence in the pair. The classification and statistics of distinct corrections made by the leader judges are provided in Table 2. A strict classification of error types is quite hard to provide because some sentences contain more than one error. We thus order the error types from the most serious errors (i.e. 'sense' errors) to the redundant corrections (i.e. 'other' type). If a sentence contains several errors, it is qualified for the higher order error type.
In the case of sentences with 'sense' errors, the need for correction is uncontroversial and arises from an internal logical contradiction. 8 The sentences with 'semantic' changes are syntactically correct, but deemed unacceptable by the leader annotators from the semantic or pragmatic point of view. 9 The 'grammatical' errors mostly concern missing agreement. 10 The majority of 'word order' corrections are unnecessary, but we found some examples which can be classified as actual word or phrase order errors. 11 The correction of punctuation consists in adding or deleting a comma. 12 The sentences in the 'other' group, in turn, could as well have been left unchanged because they are proper Polish sentences, but were apparently considered odd by the leader annotators. 8 An example of 'sense' error: the sentence Chłopak w zielonej bluzie i czapce zjeżdża na rolkach na leżąco. (Eng. 'A boy in a green sweatshirt and a cap roller-skates downhill in a lying position.') is corrected into Chłopak w zielonej bluzie i czapce zjeżdża na rolkach. (Eng. 'A boy in a green sweatshirt and a cap roller-skates downhill.'). 9 An example of 'semantic' correction: the sentence Dziewczyna trzyma w pysku patyk. (Eng. 'A girl holds a stick in her muzzle.') is corrected into Dziewczyna trzyma w ustach patyk. (Eng. 'A girl holds a stick in her mouth.').
11 An example of word order error: the sentence Samochód, który jest uszkodzony, koloru białego stoi na lawecie dużego auta. (lit. 'A car that is damaged, of the white color stands on the trailer of a large car.', Eng. 'A white car that is damaged is standing on the trailer of a large car.') is corrected into Samochód koloru białego, który jest uszkodzony, stoi na lawecie dużego auta.
12 An example of punctuation correction: the wrong comma in the sentence Nad brzegiem wody, stoją dwaj mężczyźni z wędkami. (lit. 'On the water's edge, two men are standing with rods.'; Eng. 'Two men with rods are standing on the water's edge.') should be deleted, i.e. Nad brzegiem wody stoją dwaj mężczyźni z wędkami.

Impromptu post-corrections
During the annotation process it came out that sentences accepted by some human annotators are unacceptable for other annotators. We thus decided to garner annotators' comments and suggestions for improving sentences. After validation of these suggestions by an experienced linguist, it turns out that most of these proposals concern punctuation errors (e.g. missing comma) and typos in 312 distinct sentences. These errors are fixed directly in the corpus because they should not impact the annotations of sentence pairs. The other suggestions concern more significant changes in 29 distinct sentences (mostly minor grammatical or semantic problems overlooked by the leader annotators). The annotations of pairs with modified sentences are resent to the annotators so that they can verify and update them.

Corpus summary and evaluation 4.1 Corpus statistics
Tables 3 and 4 summarise the annotations of the resulting 10K sentence pairs corpus. Table  3 aggregates the occurrences of 6 possible relatedness scores, calculated as the mean of all 6 individual annotations, rounded to an integer. 0  1978  1  1428  2  1082  3  2159  4  2387  5  966   Table 3: Final relatedness scores rounded to integers (total: 10K pairs). Table 4 shows the number of the particular entailment labels in the corpus. Since each sentence pair is annotated for entailment in both directions, the final entailment label is actually a pair of two labels:
While the actual corpus labels are ordered in the sense that there is a difference between e.g. entailment+neutral and neutral+entailment (the entailment occurs in different directions), we treat all labels as unordered for the purpose of this summary (e.g. entailment+neutral covers neutral+entailment as well, representing the same type of relation between two sentences).

Inter-annotator agreement
The standard measure of inter-annotator agreement in various natural language labelling tasks is Cohen's kappa (Cohen, 1960). However, this coefficient is designed to measure agreement between two annotators only. Since there are three annotators of each pair of ordered sentences, we decided to apply Fleiss' kappa 13 (Fleiss, 1971) designed for measuring agreement between multiple raters who give categorical ratings to a fixed number of items. An additional advantage of this measure is that different items can be rated by different human judges, which doesn't impact measurement. The normalised Fleiss' measure of inter-annotator agreement is: where the quantityP −P e measures the degree of agreement actually attained in excess of chance, while "[t]he quantity 1 −P e measures the degree of agreement attainable over and above what would be predicted by chance" (Fleiss, 1971, p. 379).
We recognise Fleiss' kappa as particularly useful for measuring inter-annotator agreement with respect to entailment labelling in our evaluation dataset. First, there are more than two raters. Second, entailment labels are categorically. Measured 13 As Fleiss' kappa is actually the generalisation of Scott's π (Scott, 1955), it is sometimes referred to as Fleiss' multi-π, cf. Artstein and Poesio (2008). with Fleiss' kappa, there is an inter-annotator agreement of κ = 0.734 for entailment labels in Polish evaluation dataset, which is quite satisfactory as for a semantic labelling task.
Relative to semantic relatedness, the distinction in meaning of two sentences made by human judges is often very subtle. This is also reflected in the inter-annotator agreement scores measured with Fleiss' kappa. Inter-annotator agreement measured for six semantic relatedness groups corresponding to points on the Likert scale is quite low: κ = 0.337. If we measure interannotator agreement for three classes corresponding to the three relatedness groups from the annotation guidelines (see Section 3.1), i.e. <0>, <1, 2, 3, 4>, and <5>, the Fleiss' score is significantly higher: κ = 0.543. Hence, we conclude that Fleiss' kappa is not a reliable measure of inter-annotator agreement in relation to relatedness scores. Therefore, we decided to use Krippendorff's α instead.
Krippendorff's α (Krippendorff, 1980(Krippendorff, , 2013 is a coefficient appropriate for measuring the interannotator agreement of a dataset which is annotated with multiple judges and characterised by different magnitudes of disagreement and missing values. Krippendorff proposes distance metrics suitable for various scales: binary, nominal, interval, ordinal, and ratio. In ordinal measurement 14 the attributes can be rank-ordered, but distances between them do not have any meaning. Measured with Krippendorff's ordinal α, there is an inter-annotator agreement of α = 0.780 for relatedness scores in the Polish evaluation dataset, which is quite satisfactory as well. Hence, we conclude that our dataset is a reliable resource for the purpose of evaluating compositional distributional semantics model of Polish.

Conclusions
The goal of this paper is to present the procedure of building a Polish evaluation dataset for the validation of compositional distributional semantics models. As we aim at building an evalua-tion dataset which is comparable to the SICK corpus, the general assumptions of our procedure correspond to the design principles of the SICK corpus. However, the procedure of building the SICK corpus cannot be adapted without modifications. First, the Polish seed-sentences have to be written based on the images which are selected from 8K ImageFlickr dataset and split into thematic groups, since usable datasets are not publicly available. Second, since the process of transforming sentences seems to be language-specific, the linguistic transformation rules appropriate for Polish have to be defined from scratch. Third, the process of arranging Polish sentences into pairs is defined anew taking into account the data characteristic and bidirectional entailment annotations. The discrepancies relative to the SICK procedure also concern the annotation process itself. Since an entailment relation between two sentences must not be symmetric, each sentence pair is annotated for entailment in both directions. Furthermore, we introduce an element of human verification of correctness of automatically transformed sentences and some additional post-corrections.
The presented procedure of building a dataset was tested on Polish. However, it is very likely that the annotation framework will work for other Slavic languages (e.g. Czech with an excellent dependency parser).
The presented procedure results in building the Polish test corpus of relatively high quality, confirmed by the inter-annotator agreement coefficients of κ = 0.734 (measured with Fleiss' kappa) for entailment labels and of α = 0.780 (measured with Krippendorff's ordinal alpha) for relatedness scores.