A Semantic Annotation Scheme for Quantification

This paper describes in brief the proposal called ‘QuantML’ which was accepted by the International Organisation for Standards (ISO) last February as a starting point for developing a standard for the interoperable annotation of quantification phenomena in natural language, as part of the ISO 24617 Semantic Annotation Framework. The proposal, firmly rooted in the theory of generalised quantifiers, neo-Davidsonian semantics, and DRT, covers a wide range of quantification phenomena. The QuantML scheme consists of (1) an abstract syntax which defines ‘annotation structures’ as triples and other set-theoretic constructs; (b) a compositional semantics of annotation structures; (3) an XML representation of annotation structures.


Introduction
Quantification is widespread in spoken and written language; it can be found in nearly every sentence, since it occurs whenever a predicate is applied to one or more argument sets (rather than single arguments). This commonly happens in two of the linguistically most prominent units: clauses and noun phrases. In clauses it happens when a verb is combined with sets of arguments, as in "Last year the American car manufacturers produced more than 12 million vehicles". In noun phrases it happens when a head noun is subject to modification, as in "I'm carrying some heavy books".
A widely held view is that the quantifiers of natural language are not determiners like "some" and "all", in spite of their superficial similarity with the quantifiers of formal logic, but rather noun phrases, like "more than 12 million vehicles" and "some heavy books". Other types of quantifiers can also be found, such as adverbs for temporal or spatial quantification ("always", "nowhere", "sometimes"), but these are of minor importance compared to noun phrases. Hobbs and Shieber (1987) have argued that the sentence "Some representatives of every department in most companies saw a few samples of every product", containing five noun phrases, has 42 readings, corresponding to equally many linguistically valid alternative scopings of the five quantifiers (of the120 mathematically possible permutations). Bunt and Muskens (1999) show that, when other ambiguity types are taken into account, such as those of a quantification's distributivity, the number of possible readings of an ordinary written sentence may run into the thousands.
Quantification is the main source of structural ambiguity in natural langue; therefore applications of natural language processing for which semantic information is important, such as information extraction, and question answering, need effective ways of interpreting quantification expressions. This calls for a flexible and interoperable way of indicating aspects of quantification.
This paper presents an annotation scheme, called 'QuantML', where a range of aspects of natural language quantification are captured by a relatively small number of features, distributed over the components of annotation structures in a way that allows a compositional semantic interpretation. The QuantML scheme is based on a number of preliminary studies including Bunt (2017), Bunt et al. (2018), and Bunt (2018), and has recently been accepted as the basis for developing Part 12 of the ISO Semantic annotation framework (ISO 24617).
ISO standard 24617-1 for the annotation of time and events, commonly known as 'ISO-TimeML', has certain provisions for dealing with time-related quantification. For example, the temporal quantifier "daily" is represented as follows, where "P1D" stands for "period of one day": (1) <TIMEX3 xml:id="t5" target="#token0" type="SET" value="P1D" quant="EVERY"/> In ISO-TimeML @quant is one of the attributes of temporal entities, used to indicate that the entity is involved in a quantification. The limitations of this approach for annotating temporal quantification have been discussed by .
The Parallel Meaning Bank (PMB, Abzianidze et al., 2017), building upon the Groningen Meaning Bank (GMB, Bos et al., 2017), is a corpus of semantically annotated sentences and texts in English, German, Dutch and Italian in raw and tokenised format with formal meaning representations. The PMB aims to provide fine-grained meaning representations in DRT for the most likely interpretation of a sentence, with a minimal use of underspecification. The GMB and the PMB are very useful resources for semantic study, but this work is somewhat different from the usual kind of annotation work, where semantic features are associated with small stretches of text or speech.

Ambiguity and lack of specificity
The multiplicity of possible readings of quantifications forms a challenge for language understanding systems, but hardly for humans, who are mostly not aware of the ambiguities. Human annotators who are not trained linguists or logicians likewise tend not to see all the possible readings of quantifications. Automatic annotation processes run into the same problems as language understanding systems, having a lack of general world knowledge and situation-specific context information. Both for manual and for automatic annotation it is therefore of practical importance to not be forced to make more specific choices than the available information and skills justify. On the other hand, it should of course be possible for a skilled annotator to make precise annotations if sufficiently detailed information is available. A useful annotation scheme should thus allow specifications with varying degrees of granularity.
The ambiguity challenge that quantification poses for automatic language understanding has led to the development of underspecification techniques in computational semantics, in particular for underspecified representation of quantifier scope (e.g. Alshawi, 1992;Bos, 1995;Reyle, 1993, Willis & Man-andhar, 2001). In the same vein, QuantML allows the annotation of quantifier scope to be underspecified by making the specification of scope relations between a pair of quantifiers optional.
Scope is not the only source of quantifier ambiguity; ambiguities in the distributivity and individuation of quantifiers make the ambiguity problem even more dramatic than has generally been assumed in the literature, due to issues concerning precision and individuation which are considered next.

Precision and distributivity
Ambiguity in natural language quantification is mostly considered in terms of the number of logically precise interpretations. But natural language expressions are sometimes not meant to be interpreted with logical precision. This is in particular the case for quantifier distributivity. Consider the following example: (3) The men carried all the boxes upstairs.
In the event(s) described by this sentence it is not necessarily the case that all the carrying was done either collectively or individually; the sentence could for instance describe a set of events in which the men collectively carried the heaviest boxes, and individually the lighter ones. This means that the distributivity of the quantification over the set of men is neither collective nor individual (and the same is true for the quantification over the boxes); the term 'unspecific' has been introduced for this distributivity (Bunt, 1985). This interpretation can be represented in second-order predicate logic as shown in (4), where following Kamp & Reyle (1993) 1 the notation X * is used to designate the set consisting of the members and subsets of X, and moreover the subscript notation P 0 to designate the characteristic predicate of the reference domain of a quantifier, 2 which is a contextually determined part of the quantifier's source domain (as determined by an NP head), characterized by the predicate P.
This representation says that for every box in a given reference domain there is a carry-event in which a contextually distinguished man or group of men carried it upstairs or carried a set of boxes upstairs that contains it. A quantification with unspecific distributivity has both individual and collective participation as special cases, so 'unspecific' could be used to avoid having to choose a more specific distributivity. In the case of (3), 'unspecific' is the correct distributivity to assign. In cases where it is difficult to decide on the distributivity, 'unspecific' could be useful as a coarse-grained default value in annotations. 3

Individuation
The 'individuation' of a quantification is another source of ambiguity, as illustrated in (5): (5) a. I see no chicken in the garden.
b. I see no chicken in the stew.
The count/mass distinction is often characterized semantically in terms of 'individuation': "To learn 'apple' ... we must learn how much counts as an apple. (...) Such terms possess built-in modes (...) of dividing their reference (..) Consider 'shoe', 'pair of shoes', and 'footwear': (...) two of the them divide their reference differently, and the third not at all." (Quine, 1960). In other words, count nouns have a domain of reference made up of individuals, while that of a mass noun is made up of entities (often called 'quantities') with mereological part-whole relations.
Quantifiers expressed by an NP with a count head noun may be ambiguous in a different way, as illustrated by (6a). This sentence could for example describe a series of events where last Monday Mario had a pizza, on Wednesday he had another pizza plus a few slices, and on Friday he had the slices remaining from Wednesday. Pizzas are a domain where it is common to consider parts of individuals, like in many domains related to food and drink. For other domains this may be less common, but in principle every physical object has parts, and many abstract objects as well. When interpreting an NP that describes domain involvement or domain size in terms of a non-integer number of individuals, this is clearly necessary. The interpretation of sentence (6a) as describing a set of events in which Mario has eaten some pieces of pizza, adding up to a total of three pizzas, can be represented by (6b), where P + designates the property of being a part of an individual that has the property P, 4 and Σ designates the joining together of parts of an individual. 5 (6) a. Mario had three pizzas last week.
b. ∃Y (∀y (y ∈ Y → (pizza + (y) ∧ ∃e (eat(e) ∧ agent(e, Mario) ∧ theme(e, y)))) ∧ |ΣY| pizza =3) In model-theoretic semantics it is commonly assumed that individuals are atomic concepts, but for examples like the above we must assume an ontology where individuals have parts (in the mereological sense). This part-whole relation has the same logical properties as the corresponding relation for mass nouns. 6 A quantification where parts of individuals should be taken into account will be said to have the individuation "count/parts".
Individuation and distributivity are distinct aspects of quantification; elements from a domain with count/parts individuation can for example participate collectively in a quantification, as in the report from a weight-lifting contest that "Tarzani lifted 27.5 pizzas" (see (15) below), if Tarzani lifted a pile of 25 whole pizzas topped by a short stack of five pizza halves.

Theoretical background
The theory of generalized quantifiers (GQT) has been successful in describing and understanding many aspects of natural language quantification. Together with neo-Davidsonian event semantics and Discourse Representation Theory (DRT), GQT forms the theoretical basis of the approach to quantification annotation taken in this paper.
GQT exploits the fact that quantification in natural language differs from that in formal logic in that logical quantifiers like "for all x" and "there exists an x" range over all the individual objects in a given universe of discourse, whereas quantifying expressions in natural language like "all the students", "an essay", "some coffee", indicate a restricted domain that the quantification applies to GQT therefore views noun phrases as the quantifiers of natural language (rather than determiners) ( Barwise and Cooper, 1981; see e.g. also van Benthem andter Meulen, 1985 andSzabolcsi, 2010). This view generalizes to natural language quantifiers like "two books", "less than three weeks", "thirty tons of peanut butter". Semantically, generalized quantifiers are viewed as expressing properties of sets of individuals; for example, the quantifier "more than three essays" is interpreted as the property of being a set that contains more than three essays. Davidson (1967) proposed to treat events as individual objects, facilitating the semantic interpretation of adverbs, like "quickly", "passionately", and adverbial quantifying expressions such as "everywhere", "never" and "at least three times". Following Parsons (1990), this event-based semantics can be expressed in semantic representations by means of one-place predicates applied to existentially quantified event variables, and two-place predicates to indicate the semantic roles of the participants in an event. This 'neo-Davidsonian' approach has been adopted in the ISO annotation standards 24617-1 (Time and events), 24617-4 (Semantic roles), 24617-7 (Spatial information), and 24617-8 (Discourse relations). Champollion (2015) has shown that GQT and neo-Davidsonian semantics can be combined successfully. Still, natural language quantification is a semantically extremely complex set of phenomena, and especcially the interpretation of plural noun phrases presents certain theoretical challenges for GQT (see e.g. Schwertel, 2005), some of which have been successfully been approached in DRT (Kamp and Reyle, 1993), which has other limitations. Luckily, providing a semantics for quantification annotations is less challenging than providing a semantics for natural language expressions involving quantifications.
Several of the ISO semantic annotation standards use DRT's Discourse Representation Structures (DRSs) for defining a semantics of annotation structures. QuantML follows suit, combining ideas from GQT, neo-Davidsonian semantics, and DRSs in the semantics of its annotation structures.

Annotation scheme architecture
The annotation scheme outlined in this paper has been designed according to the ISO principles of semantic annotation (ISO standard 24617-6). This means that the scheme has a three-part definition consisting of (1) an abstract syntax that specifies the possible annotation structures at a conceptual level as set-theoretical constructs, such as pairs and triples of concepts; (2) a semantics that specifies the meaning of the annotation structures defined by the abstract syntax; (3) a concrete syntax, that specifies a representation format for annotation structures using XML expressions. Defining the semantics at the level of the abstract syntax puts the focus of an annotation standard at the conceptual level, rather than that of representation formats. Alternative representation formats may be defined with guaranteed interoperability . Annotators (human or automatic) deal with concrete representations only, but they can rely on the existence of an underlying abstract syntax and semantics.
Example (7) shows the QuantML annotation structure (in a slightly simplified form), XML representation, and semantics for the collectdive reading of a simple sentence. Besides the usual box notation of DRT also a string notation will be used, which is shown in (7e). Capital variables are used to designate non-empty sets.
(7) a. Two thousand students protested Markables: m1 = Two thousand students; m2 = students; m3 = protested b. QuantML annotation structure: m3, protest , { m1, student, λ z.|z|= 2000, indef }, agent, collective , {} c. Annotation representation: <entity xml:id="x1" target="#m1" pred="student" involvement="2000"/> <entity xml:id="e1" target="#m3" pred="protest"/> <participantLink event="#e1" participant="#x1" semRole="agent" distr="collective"/> d. Semantics: It may be noted that the annotation semantics in (7d,e) is structurally the same as the DRS that Kamp and Reyle propose for the collective reading of the sentence "Three lawyers hired a new secretary" (Kamp and Reyle 1993, p. 327). For the individual reading of the sentence (7a), where the students act in individual protest-events (e.g. writing personal letters of protest), the annotation structure and its XML representation would both have 'individual' instead of 'collective' (and narrow event scope, by default), and the DRS interpretation would be as in (8): Note also that the discourse referent X in these DRSs stands for the set of entities that participate in the protest events, which corresponds to the set of entities (or the property) that in a classical linguistic analysis is denoted by the VP. The DRS thus has a condition of the form [x ∈ X → ...θ (e,x)] for the individual reading and a condition with θ (e,X) for the collective reading. The two conditions |X|=2000, [x ∈ X → student(x)] together reflect the GQT interpretation of the subject NP.

Abstract syntax
The structures defined by the abstract syntax are n-tuples of elements that are either basic concepts, taken from a store called the 'conceptual inventory', or, recursively, such n-tuples. Two types of structure are distinguished: entity structures and link structures. An entity structure contains semantic information about a segment of primary data and is formally a pair m, s consisting of a markable, which refers to a segment of primary data, and certain semantic information. A link structure contains information about the way two or more segments of primary data are semantically related.
QuantML has three kinds of entity structures: (1) for events; (2) for participants; (3) for restrictions on sets of participants. A quantified set of participants is characterized by the following properties: The entity structure m, s for a set of participants thus contains a triple s = D, v , q, d with D = characteristic domain predicate, v = individuation, q = reference domain involvement, and d = definiteness, with possibly an additional size specification. The domain component is more complex when the restrictor of an NP contains one or more head noun modifiers and/or multiple, conjoined heads (see Bunt 2018 for details). Entity structures for sets of events are simpler than those for participants; they contain just a predicate that characterizes a domain of events, and if applicable the cardinality of a set of repeated events or the frequency of recurring events.
Two kinds of link structure are defined: participation structures, which link participants to events, and scope link structures. Participation structures specify (1) a set of events; (2) a set of participants; (3) a semantic role; (4) the distributivity of the participation; (5) the relative scope of the event quantification. Scope link structures specify the relative scope of two participant entity structures.
Annotation structures for quantification are associated mostly with clauses and their constituent NPs and verbs. The annotation structure for a clause is a quadruple consisting of an event structure, a set of participant structures, a set of participation link structures, and a (possibly empty) set of scope link structures. In a complete clause annotation structure all participant entity structures are linked to the verb's event entity structure, and all the relative scopes of all participant entity structures are specified.

Semantics
The QuantML semantics specifies a recursive interpretation function I Q that translates annotation structures into DRSs in a compositional way: the interpretation of an annotation structure is obtained by combining the interpretations of its component entity structures and participation link structures, in a way that is determined by scope link structures (if any). A full specification of the QuantML semantics would go beyond the scope of this paper; the reader is referred to Bunt (2018, Appendix C). Here we outline the overall approach and present some interesting parts of the definition of I Q .
The QuantML interpretation function translates every participant entity structure, event entity structure, and participation link structure into a DRS and combines these. Consider the example in (7). The entity structures for "Two thousand students", and "protested" are translated into the DRSs shown in (10). For the participant entity structure this is achieved by applying an instance of clause (9a) in the I Q definition, which interprets entity structures with source domain D, individuation v, involvement q, and definiteness indef. The interpretation q of domain involvement specification q is defined in (9b-c), and that of the domain specification in (9d-e).
The DRS in (10a) says that there exists a set with the property of containing two thousand students, reflecting the GQT approach to NP interpretation. The DRS in (10b) together with (12) illustrates the adoption of neo-Davidsonian event semantics.
The participation link structure has in this example the form ε E , {ε P1 }, R, d, σ , where ε E and ε P1 are the participant and event entity structures that are linked in the Agent role (R = Agent), with d = collective, and σ (event scope) = narrow. The semantic interpretation of such a structure is defined as follows, where '∪' designates the familiar merge operation for DRSs: Triples like R, d, σ are interpreted as shown in (12): Applying rule (11) to the right-hand sides of (10) and (12c) , with the values for R, d and σ substituted, gives the desired result shown in (7d,e): The annotation structures defined by the QuantML abstract syntax can be deeply nested, since participation link structures contain the entity structures that they link; see the argument of the I Q function in (11). (Their XML representations, by contrast, are 'flat', which is more convenient for their practical use.) A participant entity structure inside a link structure can itself have a complex structure, for instance due to the head noun of an NP being modified by a quantifying relative clause. In a well-formed annotation structure for a clause that contains only a single NP, like (7), such a link structure contains all the semantic information. The only scoping in such cases is between the NP quantifier and the verb viewed as an event quantifier (which is useful for examples like "All passengers died in the crash" and "Mary wants to buy an inexpensive coat", cf. Szabolcsi, 2010). For clauses with multiple NPs the additional information about their relative scopes is taken into account in the I Q function by applying 'scoped merge' operations to their interpretations, and where appropriate inversion operations in order to obtain the interpretations of 'inversely linked' quantified head noun modification by a PP or a relative clause (Barker, 2014). The reader is referred to Bunt (2018) for details. This section illustrates the use of QuantML with a few examples. The first example concerns two quantifications with unspecific distributivity and an NP head with adjectival modification.
(13) a. Two young men carried all the boxes upstairs.

Conclusions and Further Work
The QuantML annotation scheme was recently proposed to the International Organisation for Standardisation for developing into part 12 of the ISO Semantic Annotation Framework, and was accepted as such in March 2019 (ISO, 2019a). The QuantML scheme is rooted in the theory of generalized quantifiers, neo-Davidsonian event semantics, and DRT, and is methodologically shaped after the ISO principles of semantic annotation (ISO standard 24617-6). Different from these semantic theories, the proposed annotation scheme has a number of provisions for leaving aspects of quantification unspecified, on the one hand intended to reflect the vagueness and ambiguity that natural language quantifications may have, and on the other hand to allow annotators to make annotations with varying degrees of granularity.
The current proposal still has several loose ends, e.g. related to modality and polarity and to intensional contexts. It is also fair to say that, where GQT and DRT do not provide adequate solutions for all the complexities of quantification in natural language, currently no annotation scheme can be expected to do much better.
The next important thing after or while further elaborating the proposed annotation scheme, is to apply it in annotation projects and see to what extent it may need to be adapted in order to be optimally useful for language technology applications and for empirically-based semantic investigations.