Annotating omission in statement pairs

We focus on the identification of omission in statement pairs. We compare three annotation schemes, namely two different crowdsourcing schemes and manual expert annotation. We show that the simplest of the two crowdsourcing approaches yields a better annotation quality than the more complex one. We use a dedicated classifier to assess whether the annotators’ behavior can be explained by straightforward linguistic features. The classifier benefits from a modeling that uses lexical information beyond length and overlap measures. However, for our task, we argue that expert and not crowdsourcing-based annotation is the best compromise between annotation cost and quality.


Introduction
In a user survey, the news aggregator Storyzy 1 found out that the two main obstacles for user satisfaction when accessing their site's content were redundancy of news items, and missing information respectively. Indeed, in the journalistic genre that is characteristic of online news, editors make frequent use of citations as prominent information; yet these citations are not always in full. The reasons for leaving information out are often motivated by the political leaning of the news platform.
Existing approaches to the detection of political bias rely on bag-of-words models (Zhitomirsky-Geffet et al., 2016) that examine the words present in the writings. Our goal is to go beyond such approaches, which focus on what is said, by instead focusing on what is omitted. Thus, this method requires a pair of statements; an original one, and a shortened version with some deleted words or spans. The task is then to determine whether the 1 http://storyzy.com information left out in the second statement conveys substantial additional information. If so, the pair presents an omission; cf. Table 1.
Omission detection in sentence pairs constitutes a new task, which is different from the recognition of textual entailment-cf. (Dagan et al., 2006)because in our case we are certain that the longer text entails the short one. What we want to estimate is whether the information not present in the shorter statement is relevant. To tackle this question, we used a supervised classification framework, for which we require a dataset of manually annotated sentence pairs. We conducted an annotation task on a sample of the corpus used by the news platform (Section 3). In this corpus, reference statements extracted from news articles are used as long 'reference' statements, whereas their short 'target' counterparts were selected by string and date matching.
We followed by examining which features help identify cases of omission (Section 4). In addition to straightforward measures of word overlap (the Dice coefficient), we also determined that there is a good deal of lexical information that determines whether there is an omission. This work is, to the best of our knowledge, the first empirical study on omission identification in statement pairs. 2

Related work
To the best of our knowledge, no work has been published about omission detection as such. However, our work is related to a variety of questions of interest that resort both to linguistics and NLP.
Segment deletion is one of the most immediate forms of paraphrase, cf. Vila et al. (2014) for a survey. Another phenomenon that also presents the notion of segment deletion, although in a very different setting, is ellipsis. In the case of an ellipsis, the deleted segment can be reconstructed given a discourse antecedent in the same document, be it observed or idealized (Asher et al., 2001;Merchant, 2016). In the case of omission, a reference and a target version of a statement are involved, the deleted segment in one version having an antecedent in the other version of the statement, in another document, as a result of editorial choices.
Our task is similar to the problem of omission detection in translations, but the bilingual setting allows for word-alignment-based approaches (Melamed, 1996;Russell, 1999), which we cannot use in our setup. Omission detection is also related to hedge detection, which can be achieved using specific lexical triggers such as vagueness markers (Szarvas et al., 2012;Vincze, 2013).

Annotation Task
The goal of the annotation task is to provide each reference-target pair with a label: Omission, if the target statement leaves out substantial information, or Same if there is no information loss.
Corpus We obtained our examples from a corpus of English web newswire. The corpus is made up of aligned reference-target statement pairs; cf. Table 1 for examples. These statements were aligned automatically by means of word overlap metrics, as well as a series of heuristics such as comparing the alleged speaker and date of the statement given the article content, and a series of text normalization steps. We selected 500 pairs for annotation. Instead of selecting 500 random pairs, we selected a contiguous section from a random starting point. We did so in order to obtain a more natural proportion of reference-to-target statements, given that reference statements can be associated with more than one target. 3 Annotation setup Our first manual annotation strategy relies on the AMT crowdsourcing platform. We refer to AMT annotators as turkers. For each statement pair, we presented the turkers with a display like the one in Figure 1.
We used two different annotation schemes, namely OM p , where the option to mark an omission is "Text B leaves out some substantial information", and OM e , where it is "Text B leaves out 3 The full distribution of the corpus documentation shall provide more details on the extraction process.
something substantial, such as time, place, cause, people involved or important event information." The OM p scheme aims to represent a naive user intuition of the relevance of a difference between statements, akin to the intuition of the users mentioned in Section 1, whereas OM e aims at capturing our intuition that relevant omissions relate to missing key news elements describable in terms of the 5-W questions (Parton et al., 2009;Das et al., 2012). We ran AMT task twice, once for each scheme. For each scheme, we assigned 5 turkers per instance, and we required that the annotators be Categorization Masters according to the AMT scoring. We paid 0.05$ per instance.
Moreover, in order to choose between OM p and OM e , two experts (two of the authors of this article) annotated the same 100 examples from the corpus, yielding the OE annotation set. Annotation results The first column in Table  2 shows the agreement of the annotation tasks in terms of Krippendorff's α coefficient. A score of e.g. 0.52 is not a very high value, but is well within what can be expected on crowdsourced semantic annotations. Note, however, the chance correction that the calculation of α applies to a skewed binary distribution is very aggressive (Passonneau and Carpenter, 2014). The conservativeness of the chance-corrected coefficient can be assessed if we compare the raw agreement between experts (0.86) with the α of 0.67. OM e causes agreement to descend slightly, and damages the agreement of Same, while Omission remains largely constant. Moreover, disagreement is not evenly distributed across annotated instances, i.e. some instances show perfect agreement, while other instances have maximal disagreement.
We also measured the median annotation time per instance for all three methods; OM e is almost twice as slow as OM p (42s vs. 22s), while  the the expert annotation time in OE is 16s. The large time difference between OM p and OM e indicates that changing the annotation guidelines has indeed an effect in annotation behavior, and that the agreement variation is not purely a result of the expectable annotation noise in crowdsourcing. The fourth and fifth columns in Table 2 show the label distribution after adjudication. While the distribution of Omission-Same labels is very similar after applying simple majority voting, we observe that the distribution of the agreement does change. In OM p , approx. 80% of the Same-label instances are assigned with a high agreement (at least four out of five votes), whereas only a third of the Same instances in OM e have such high agreement. Both experts have a similar perception of omission, albeit with a different threshold: in the 14 where they disagree, one of the annotators shows a systematic preference for the Omission label.
We also use MACE to evaluate the stability of the annotations. Using an unsupervised expectation-maximization model, MACE assigns confidence to annotators, which are used to estimate the resulting annotations (Hovy et al., 2013). While we do not use the label assignments from MACE for the classification experiments in Section 4, we use them to measure how much the proportion of omission changes with regards to simple majority voting. The more complex OM e scheme has, parallel to lower agreement, a much higher fluctuation-both in relative and absolute terms-with regards to OM p , which also indicates this the former scheme provides annotations that are more subject to individual variation. While this difference is arguably of a result of genuine linguistic reflection, it also indicates that the data obtained by this method is less reliable as such.
To sum up, while the label distribution is similar across schemes, the Same class drops in overall agreement, but the Omission class does not.
In spite of the variation suggested by their α coefficient, the two AMT annotated datasets are very similar. They are 85% identical after label assignment by majority voting. However, the cosine similarity between the example-wise proportions of omission labels is 0.92. This difference is a consequence of the uncertainty in low-agreement examples. The similarity with OE is 0.89 for OM p and 0.86 for OM e ; OM p is more similar to the expert judgment. This might be related to the fact that the OM e instructions prime turkers to favor named entities, leading them to pay less attention to other types of substantial information such as modality markers. We shall come back to the more general role of lexical clues in Section 4.
Given that it is more internally consistent and it matches better with OE, we use the OM p dataset for the rest of the work described in this article.

Classification experiments
Once the manually annotated corpus is built, we can assess the learnability of the Omission-Same decision problem, which constitutes a binary classification task. We aimed at measuring whether the annotators' behavior can be explained by simple proxy linguistic properties like word overlap or length of the statements and/or lexical properties.
Features: For a reference statement r, a target statement t and a set M of the words that only appear in r, we generate the following feature sets: 1. Dice (F a ): Dice coefficient between r and t. predicted by the 4-class Stanford Named Entity Recognizer (Finkel et al., 2005). Table 3 shows the classification results. We use all exhaustive combinations of these feature sets to train a discriminative classifier, namely a logistic regression classifier, to obtain a best feature combination. We consider a feature combination to be the best when it outperforms the others in both accuracy and F1 for the Omission label. We compare all systems against the most frequent label (MFL) baseline. We evaluate each feature twice, namely using five-cold cross validation (CV-5 OM p ), and in a split scenario where we test on the 100 examples of OE after training with the remaining 400 examples from OM p (Test OE). The three best systems (i.e. non-significantly different from each other when tested on OM p ) are shown in the lower section of the table. We test for significance using Student's two-tailed test and p <0.05.
As expected, the overlap (F a ) and length metrics (F b ) make the most competitive standalone features. However, we want to measure how much of the labeling of omission is determined by which words are left out, and not just by how many.
The system trained on BoW outperforms the system on DWR. However, BoW features contain a proxy for statement length, i.e. if n words are different between ref and target, then n features will fire, and thus approximate the size of M . A distributional semantic model such as GloVe is however made up of non-sparse, real-valued vec-  tors, and does not contain such a proxy for word density. If we examine the contribution of using F d as a feature model, we see that, while it falls short of its BoW counterpart, it beats the baseline by a margin of 5-10 points. In other words, regardless of the size of M , there is lexical information that explains the choices of considering an omission.

Conclusion
We have presented an application-oriented effort to detect omissions between statement pairs. We have assessed two different AMT annotation schemes, and also compared them with expert annotations. The extended crowdsourcing scheme is defined closer to the expert intuition, but has lower agreement, and we use the plain scheme instead. Moreover, if we examine the time need for annotation, our conclusion is that there it is in fact detrimental to use crowdsourcing for this annotation task with respect to expert annotation. Chiefly, we also show that simple linguistic clues allow a classifier to reach satisfying classification results (0.86-0.88 F1), which are better than when solely relying on the straightforward features of different length and word overlap. Further work includes analyzing whether the changes in the omission examples contain also changes of uncertainty class (Szarvas et al., 2012) or bias type (Recasens et al., 2013), as well as expanding the notion of omission to the detection of the loss of detail in paraphrases. Moreover, we want to explore how to identify the most omissionprone news types, in a style similar to the characterization of unreliable users in Wei et al. (2013).