Classification of telicity using cross-linguistic annotation projection

This paper addresses the automatic recognition of telicity, an aspectual notion. A telic event includes a natural endpoint (“she walked home”), while an atelic event does not (“she walked around”). Recognizing this difference is a prerequisite for temporal natural language understanding. In English, this classification task is difficult, as telicity is a covert linguistic category. In contrast, in Slavic languages, aspect is part of a verb’s meaning and even available in machine-readable dictionaries. Our contributions are as follows. We successfully leverage additional silver standard training data in the form of projected annotations from parallel English-Czech data as well as context information, improving automatic telicity classification for English significantly compared to previous work. We also create a new data set of English texts manually annotated with telicity.


Introduction
This paper addresses the computational modeling of telicity, a linguistic feature which represents whether the event type evoked by a sentence's verb constellation (i.e., the verb and its arguments and modifiers) has a natural endpoint or not (Comrie, 1976;Smith, 1997), see (1a) and (1b).
(1) (a) Mary ate an apple. (telic) (b) I gazed at the sunset. (atelic) Automatic recognition of telicity is a necessary step for natural language understanding tasks that require reasoning about time, e.g., natural language generation, summarization, question answering, information extraction or machine translation (Moens and Steedman, 1988;Siegel and McKeown, 2000). For example, there is an entailment relation between English Progressive and Perfect constructions (as shown in (2)), but only for atelic verb constellations.
(2) (a) He was swimming in the lake. (atelic) |= He has swum in the lake. (b) He was swimming across the lake. (telic) |= He has swum across the lake.
We model telicity at the word-sense level, corresponding to the fundamental aspectual class of Siegel and McKeown (2000), i.e., we take into account the verb and its arguments and modifiers, but no additional aspectual markers (such as the Progressive). In (2) we classify whether the event types "swim in the lake" and "swim across the lake" have natural endpoints. This is defined on a linguistic level rather than by world knowledge requiring inference. "Swimming in the lake" has no natural endpoint, as also shown by the linguistic test presented in (2). In contrast, "swimming across the lake" will necessarily be finished once the other side is reached.
In English, the aspectual notion of telicity is a covert category, i.e., a semantic distinction that is not expressed overtly by lexical or morphological means. As illustrated by (2) and (3), the same verb type (lemma) can introduce telic and atelic events to the discourse depending on the context in which it occurs.
(3) (a) John drank coffee. (atelic) (b) John drank a cup of coffee. (telic) In Slavic languages, aspect is a component of verb meaning. Most verb types are either perfective or imperfective (and are marked as such in dictionaries). For example, the two occurrences of "drink" in (3) are translated into Czech using the imperfective verb "pil" and the perfective verb "vypil," respectively (Filip, 1994): 1 (4) (a) Pil kávu. (imperfective) He was drinking (some) coffee.
Our contributions are as follows: (1) using the English-Czech part of InterCorp (Čermák and Rosen, 2012) and a valency lexicon for Czech verbs (Žabokrtský and Lopatková, 2007), we create a large silver standard with automatically derived annotations and validate our approach by comparing the labels given by humans versus the projected labels; (2) we provide a freely available data set of English texts taken from MASC (Ide et al., 2010) manually annotated for telicity; (3) we show that using contextual features and the silver standard as additional training data improves computational modeling of telicity for English in terms of F 1 compared to previous work.

Related work
Siegel and McKeown (2000, henceforth SMK2000) present the first machine-learning based approach to identifying completedness, i.e., telicity, determining whether an event reaches a culmination or completion point at which a new state is introduced. Their approach describes each verb occurrence exclusively using features reflecting corpus-based statistics of the corresponding verb type. For each verb type, they collect the co-occurrence frequencies with 14 linguistic markers (e.g., present tense, perfect, combination with temporal adverbs) in an automatically parsed background corpus. They call these features linguistic indicators and train a variety of machine learning models based on 300 clauses, of which roughly 2/3 are culminated, i.e., telic. Their test set also contains about 300 clauses, corresponding to 204 distinct non-stative verbs. Their data sets are not available, but as this work is the most closely related to ours, we reimplement their approach and compare to it in Section 5. Samardžić and Merlo (2016) create a model for real-world duration of events (as short or long) of English verbs as annotated in TimeBank (Pustejovsky et al., 2003). The model is informed by temporal boundedness information collected from parallel English-Serbian data. Their only features are how often the respective verb type was aligned to Serbian verbs carrying certain affixes that indicate perfectiveness or imperfectiveness. Their usage of "verb type" differs from ours as they do not lemmatize, i.e., they always predict that "falling" is a long event, while "fall" is short. Our approach shares the idea of projecting aspectual information from Slavic languages to English, but in contrast to classifying verb types, we classify whether an event type introduced by the verb constellation of a clause is telic or atelic, making use of a machinereadable dictionary for Czech instead of relying on affix information. Loáiciga and Grisot (2016) create an automatic classifier for boundedness, defined as whether the endpoint of an event has occurred or not, and show that this is useful for picking the correct tense in French translations of the English Simple Past. Their classifier employs a similar but smaller feature set compared to ours. Other related work on predicting aspect include systems aiming at identifying lexical aspect (Siegel and McKeown, 2000;Friedrich and Palmer, 2014) or habituals (Mathew and Katz, 2009;Friedrich and Pinkal, 2015).
Cross-linguistic annotation projection approaches mostly make use of existing manually created annotations in the source language; similar to our approach, Diab and Resnik (2002) and Marasović et al. (2016) leverage properties of the source language to automatically induce annotations on the target side.

Data sets and annotation projection
We conduct our experiments based on two data sets: (a) English texts from MASC manually annotated for telicity, on which we train and test our computational models, and (b) a silver standard automatically extracted via annotation projection from the Czech-English part of the parallel corpus InterCorp, which we use as additional training data in order to improve our models. 2

Gold standard: MASC (EN)
We create a new data set consisting of 10 English texts taken from MASC (Ide et al., 2010), annotated for telicity. Texts include two essays, a journal article, two blog texts, two history texts from travel guides, and three texts from the fic-  tion genre. Annotation was performed using the web-based SWAN system (Gühring et al., 2016). Annotators were given a short written manual with instructions. We model telicity for dynamic (eventive) verb occurrences because stative verbs (e.g., "like") do not have built-in endpoints by definition. Annotators choose one of the labels telic and atelic or they skip clauses that they consider to be stative. In a first round, each verb occurrence was labeled by three annotators (the second author of this paper plus two paid student assistants). They unanimously agreed on telicity labels for 1166 verb occurrences; these are directly used for the gold standard. Cases in which only two annotators agreed on a telicity label (the third annotator may have either disagreed or skipped the clause) are labeled by a fourth independent annotator (the first author), who did not have access to the labels of the first rounds. This second annotation round resulted in 697 further cases in which three annotators gave the same telicity label. Statistics for our final gold standard, which consists of all instances for which at least three out of the four annotators agreed, are shown in Table 1; "ambiguous" verb types are those for which the gold standard contains both telic and atelic instances. 510 of the 567 verb types also occur in the InterCorp silver standard, which provides training instances for 69 out of the 70 ambiguous verb types. Finally, there are 446 cases for which no three annotators supplied the same label. Disagreement and skipping was mainly observed for verbs indicating attributions ("critics claim" or "the film uses"), which can be perceived either as statives or as instances of historic present. Other difficult cases include degree verbs ("increase"), aspectual verbs ("begin"), perception verbs ("hear"), iteratives ("flash") and the verb "do." For these cases, decisions how to treat them may have to be made depending on the concrete application; for now, they are excluded from our gold standard. Another source of error is that despite the training, annotators sometimes conflate their world knowledge (i.e., that some events necessarily come to an end eventually, such as the "swimming in the lake" in (2)) with the annotation task of determining telicity at a linguistic level.

Silver standard: InterCorp (EN-CZ)
We create a silver standard of approximately 457,000 labeled English verb occurrences (i.e., clauses) extracted from the InterCorp parallel corpus project (Čermák and Rosen, 2012). We leverage the sentence alignments, as well as part-ofspeech and lemma information provided by In-terCorp. We use the data from 151 sentencealigned books (novels) of the Czech-English part of the corpus and further align the verbs of all 1:1-aligned sentence pairs to each other using the verbs' lemmas, achieving high precision by making sure that the translation of the verbs is licensed by the free online dictionary Glosbe. 3 We then look up the aspect of the Czech verb in Vallex 2.8.3 (Žabokrtský and Lopatková, 2007), a valency lexicon for Czech verbs, and project the label telic to English verb occurrences corresponding to a perfective Czech verb and the label atelic to instances translated using imperfective verbs.
Our annotation projection approach leverages the fact that most perfective Czech verbs will be translated into English using verb constellations that induce a telic event structure, as they describe one-time finished actions. Imperfective verbs, in contrast, are used for actions that are presented as unfinished, repeated or extending in time (Vintr, 2001). They are often, but not always, translated using atelic verb constellations. A notable exception is the English Progressive: "John was reading a book" signals an ongoing event in the past, which is telic at the word-sense level but would require translation using the imperfective Czech verb "četl." The initial corpus contained 4% sentences in the Progressive, out of which 89% were translated using imperfectives. 4 Due to the above 3 https://glosbe.com 4 For comparison, in the manually annotated validation described mismatch, we remove all English Progressive sentences from our silver standard. Statistics for the final automatically created silver standard are shown in Table 1.
For validation, we sample 2402 instances from the above created silver standard and have our three annotators from the first annotation round mark them in the same way as the MASC data. Sampling picked one instance per verb type but was otherwise random. A majority agreement among the three annotators can be reached in 2126 cases (due to allowing skipping). 5 In this sample, 77.8% of the instances received the label telic from the human annotators, 61.5% received the label telic from the projection method. The accuracy of our projection method can be estimated as about 78%; F 1 for the telic class is 0.84, F 1 for atelic is 0.65. Errors made by the projection include for instance habituals, which use the imperfective in Czech but are not necessarily atelic at the event type level as in "John cycles to work every day."

Computational modeling
In this section, we describe the computational models for telicity classification, which we test on the MASC data and which we improve by adding the InterCorp silver standard data.
Features. We model each instance by means of a variety of syntactic-semantic features, using the toolkit provided by . 6 Preprocessing is done using Stanford CoreNLP (Chen and Manning, 2014) based on dkpro (Eckart de Castilho and Gurevych, 2014). For the verb's lemma, the features include the WordNet (Fellbaum, 1998) sense and supersense and linguistic indicators (Siegel and McKeown, 2000) extracted from GigaWord (Graff et al., 2003). Using only the latter as features corresponds to the system by SMK2000 as described in Section 2. The feature set also describes the verb's subject and objects; among others their number, person, countability 7 , their most frequent WordNet sense and the respective supersenses, and dependency relations between the argument and its governor(s). In addition, tense, voice and whether the clause is in the Perfect or Progressive aspect is reflected, sample only 66% of Progressives received the label atelic. 5 Of the 2402 cases, annotators completely agreed on 1577 cases (1114 telic, 203 atelic, 260 skipped). 85 cases were 2x atelic + 1x skipped, 219 cases were 2x telic + 1x skipped. 6 https://github.com/annefried/sitent 7 http://celex.mpi.nl as well as the presence of clausal (e.g., temporal) modifiers. For replicability we make the configuration files for the feature set available.
Classifier. We train L1-regularized multi-class logistic regression models using LIBLINEAR (Fan et al., 2008) with parameter settings ε=0.01 and bias=1. For each instance described by feature vector x, the probability of each possible label y (here telic or atelic) is computed according to where f i are the feature functions, λ i are the weights learned for each feature function, and Z( x) is a normalization constant (Klinger and Tomanek, 2007). The feature functions f i indicate whether a particular feature is present, e.g., whether the tense of the verb is "past."

Experiments
Experimental settings. We evaluate our models via 10-fold cross validation (CV) on the MASC data set. We split the data into folds by documents in order to make sure that no training data from the same document is available for each instance in order to avoid an unfair bias. We report results in terms of accuracy, F 1 per class and macroaverage F 1 (the harmonic mean of macro-average precision and recall). We test significance between differences in F 1 (for each class) using approximate randomization (Yeh, 2000;Padó, 2006) with p < 0.1 and significance between differences in accuracy using McNemar's test (McNemar, 1947) with p < 0.01. Table 2 shows our results: significantly different scores are marked with the same symbol where relevant (per column).

Results.
A simple baseline of labeling each instance with the overall majority class (telic) has a very high accuracy, but the output of this baseline is uninformative and results in a low F 1 . Rows titled "verb type" use the verb's lemma as their single feature and thus correspond to the informed baseline of using the training set majority class for each verb type. Rows labeled "+IC" indicate that the full set of instances with projected labels extracted from InterCorp has been added as additional training data in each fold; in rows titled "+ICs," the telic instances in InterCorp have been upsampled to match the 80:20 distribution in MASC. Our model using the full set of features significantly outperforms the verb type baseline as well as SMK2000 (see † ‡ * ). Using the additional training data from InterCorp results in a large improvement in the case of the difficult (because infrequent) atelic class (see ), leading to the best overall results in terms of F 1 . The best results regarding accuracy and F 1 are reached using the sampled version of the silver standard; the differences compared to the respective best scores in each column (in bold) are not significant.
Ablation experiments on the MASC data show that features describing the clause's main verb are most important: when ablating part-of-speech tag and tense and aspect (Progressive or Perfect), performance deteriorates by 1.8% in accuracy and 5% F 1 , hinting at a correlation between telicity and choice of tense-aspect form. Whether this is due to an actual correlation of how telic and atelic verbs are used in context or merely due to annotation errors remains to be investigated in future work.
In sum, our experiments show that using annotations projected onto English text from parallel Czech text as cheap additional training data is a step forward to creating better models for the task of classifying telicity of verb occurrences.

Conclusion
Our model using a diverse set of features representing both verb-type relevant information and the context in which a verb occurs strongly outperformed previous work on predicting telicity (Siegel and McKeown, 2000). We have shown that silver standard data induced from parallel Czech-English data is useful for creating computational models for recognizing telicity in English. Our new manually annotated MASC data set is freely available; the projected annotations for InterCorp are published in a stand-off format due to license restrictions.

Future work
Aspectual distinctions made by one language rarely completely correspond to a linguistic phenomenon observed in another language. As we have discussed in Section 3.2, telicity in English and perfectiveness in Czech are closely related. As shown by our experiments, the projected labels cover useful information for the telicity classification task. One idea for future work is thus to leverage additional projected annotations from similar phenomena in additional languages, possibly improving overall performance by combin-  ing complementary information. Clustering more than two languages may also enable us to induce clusters corresponding to the different usages of imperfective verbs in Czech. The presence of endpoints has consequences for the temporal interpretation of a discourse (Smith, 1997;Smith and Erbaugh, 2005), as endpoints introduce new states and therefore signal an advancement of time. In English, boundedness, i.e., whether an endpoint of an event has actually occurred, is primarily signaled by the choice of tense and Progressive or Perfect aspect. In tense-less languages such as Mandarin Chinese, boundedness is a covert category and closely related to telicity. We plan to leverage similar ideas as presented in this paper to create temporal discourse parsing models for such languages.
When translating, telic and atelic constructions also require different lexical choices and appropriate selection of aspectual markers. Hence, telicity recognition is also relevant for machine translation research and could be a useful component in computer aided language learning systems, helping learners to select appropriate aspectual forms.