Most “babies” are “little” and most “problems” are “huge”: Compositional Entailment in Adjective-Nouns

We examine adjective-noun (AN) composition in the task of recognizing textual entailment (RTE). We analyze behavior of ANs in large corpora and show that, despite conventional wisdom, adjectives do not always restrict the denotation of the nouns they modify. We use natural logic to characterize the variety of entailment relations that can result from AN composition. Predicting these relations depends on context and on common-sense knowledge, making AN composition especially challenging for current RTE systems. We demonstrate the inability of current state-of-the-art systems to handle AN composition in a simpliﬁed RTE task which involves the insertion of only a single word.


Overview
The ability to perform inference over utterances is a necessary component of natural language understanding (NLU). Determining whether one sentence reasonably implies another is a complex task, often requiring a combination of logical deduction and simple common-sense. NLU tasks are made more complicated by the fact that language is compositional: understanding the meaning of a sentence requires understanding not only the meanings of the individual words, but also understanding how those meanings combine.
Adjectival modification is one of the most basic types of composition in natural language. Most existing work in NLU makes a simplifying assumption that adjectives tend to be restrictive-i.e. adding an adjective modifier limits the set of things to which the noun phrase can refer. For example, the set of little dogs is a subset of the set of dogs, and we cannot in general say that dog entails little dog. This assumption has been exploited by high-performing RTE systems (MacCartney and Manning, 2008;Stern and Dagan, 2012), as well as used as the basis for learning new entailment rules (Baroni et al., 2012;Young et al., 2014).
However, this simplified view of adjectival modification often breaks down in practice. Consider the question of whether laugh entails bitter laugh in the follow-ing sentences: 1. Again his laugh echoed in the gorge.
In (1), we have no reason to believe the man's laugh is bitter. In (2), however, it seems clear from context that we are dealing with an unpleasant person for whom laugh entails bitter laugh. Automatic NLU should be capable of similar reasoning, taking both context and common sense into account when making inferences.
This work aims to deepen our understanding of AN composition in relation to automated NLU. The contributions of this paper are as follows: • We conduct an empirical analysis of ANs and their entailment properties.
• We define a task for directly evaluating a system's ability to predict compositional entailment of ANs in context.
• We benchmark several state-of-the-art RTE systems on this task.

Recognizing Textual Entailment
The task of recognizing textual entailment (RTE) (Dagan et al., 2006) is commonly used to evaluate the state-of-the-art of automatic NLU. The RTE task is: given two utterances, a premise (p) and a hypothesis (h), would a human reading p typically infer that h is most likely true? Systems are expected to produce either a binary (YES/NO) or trinary (ENTAILMENT/CONTRADICTION/UNKNOWN) output. The type of knowledge tested in the RTE task has shifted in recent years. While older datasets mostly captured logical reasoning (Cooper et al., 1996) and lexical knowledge (Giampiccolo et al., 2007) (see Examples (1) and (2) in Table 1), the recent datasets have become increasingly reliant on common-sense knowledge of scenes and events (Marelli et al., 2014). In Example (4) in Table 1, for which the gold label is ENTAILMENT, it is perfectly reasonable to assume the dogs are playing. However, this is not necessarily true that running entails playing-maybe the dogs are being   (3) have yet to be explicitly included in RTE tasks, commonsense inferences like those in (4) (from the SICK dataset) have become a common part of NLU tasks like RTE, question answering, and image labeling.
chased by a bear and are running for their lives! Example (4) is just one of many RTE problems which rely on intuition rather than strict logical inference.
Transformation-based RTE. There have been an enormous range of approaches to automatic RTE-from those based on theorem proving (Bjerva et al., 2014) to those based on vector space models of semantics (Bowman et al., 2015a). Transformation-based RTE systems attempt to solve the RTE problem by identifying a sequence of atomic edits (MacCartney, 2009) which can be applied, one by one, in order to transform p into h. Each edit can be associated with some entailment relation. Then, the entailment relation that holds between p and h overall is a function of the entailment relations associated with each atomic edit. This approach is appealing in that it breaks potentially complex p/h pairs into a series of bite-sized pieces. Transformation-based RTE is widely used, not only in rule-based approaches (MacCartney and Manning, 2008;Young et al., 2014), but also in statistical RTE systems (Stern and Dagan, 2012;. MacCartney (2009) defines an atomic edit applied to a linguistic expression as the deletion DEL, insertion INS, or substitution SUB of a subexpression. If x is a linguistic expression and e is an atomic edit, than e(x) is the result of applying the edit e to the expression x. For example: x = a 1 girl 2 in 3 a 4 red 5 dress 6 e = DEL(red, 5) e(x) = a 1 girl 2 in 3 a 4 dress 5 We say that the entailment relation that holds between x and e(x) is generated by the edit e. In the above example, we would say that e generates a forward entailment ( ) since a girl in a red dress entails a girl in a dress.

Natural Logic Entailment Relations
Natural logic (MacCartney, 2009) is a formalism that describes entailment relationships between natural language strings, rather than operating over mathematical formulae. Natural logic enables both light-weight representation and robust inference, and is an increas-ingly popular choice for NLU tasks Bowman et al., 2015b;Pavlick et al., 2015). There are seven "basic entailment relations" described by natural logic, five of which we explore here. 1 These five relations, as they might hold between an AN and the head N, are summarized in Figure 1. The forward entailment relation is the restrictive case, in which the AN (brown dog) is a subset of (and thus entails) the N (dog) but the N does not entail the AN (dog does not entail brown dog). The symmetric reverse entailment can also occur, in which the N is a subset of the set denoted by the AN. An example of this is the AN possible solution: i.e. all actual solutions are possible solutions, but there are an abundance of possible solutions that are not and will never be actual solutions.
In the equivalence relation, AN and N denote the same set (e.g. the entire universe is the same as the universe), whereas in the alternation relation, AN and N denote disjoint sets (e.g. a former senator is not a senator). In the independence relation, the AN has no determinable entailment relationship to the N (e.g. an alleged criminal may or may not be a criminal).

Simplified RTE Task
The focus of this work is to determine the entailment relation that exists between an AN and its head N in a given context. To do this, we define a simplified entailment task identical to the normal RTE task, with the constraint that p and h differ only by one atomic edit e as defined in Section 2. We look only at insertion INS(A) and deletion DEL(A), where A must be a single adjective.
We use a 3-way entailment classification where the possible labels are ENTAILMENT, CONTRADICTION, and UNKNOWN. This allows us to recover the basic entailment relation from Section 3: by determining the labels associated with the INS operation and the DEL  Figure 1: Different entailment relations that can exist between an adjective-noun and the head noun. The best-known case is that of forward entailment, in which the AN denotes a subset of the N (e.g. brown dog). However, many other relationships may exist, as modeled by natural logic. operation, we can uniquely identify each of the five relations (

Limitations
Modeling denotations of ANs and N. We note that this task design does not directly ask about the relationship between the sets denoted by the AN and by the N (as shown in Figure 1). Rather than asking "Is this instance of AN an instance of N?" we ask "Is this statement that is true of AN also true of N?" While these are not the same question, they are often conflated in NLP, for example, in information extraction, when we use statements about ANs as justification for extracting facts about the head N (Angeli et al., 2015). We focus on the latter question and accept that this prevents us from drawing conclusions about the actual set theoretic relation between the denotation of AN and the denotation of N. However, we are able to draw conclusions about the practical entailment relation between statements about the AN and statements about the N.
Monotonicity. In this simplified RTE task, we assume that the entailment relation that holds overall between p and h is attributable wholly to the atomic edit (i.e. the inserted or deleted adjective). This is an over-simplification. In practice, several factors can cause the entailment relation that holds between the sentences overall to differ from the relation that holds between the AN and the N. For example, quantifiers and other downward-monotone operators can block or reverse entailments (brown dog → dog, but no brown dog → no dog). While we make some effort to avoid selecting such sentences for our analysis (Section 5.3), fully identifying and handling such cases is beyond the scope of this paper. We acknowledge that monotone operators and other complicating factors (e.g. multiword expressions) might be present in our data, but we believe, based on manual inspection, that they not frequent enough to substantially effect our analyses.

Experimental Design
To build an intuition about the behavior of ANs in practice, we collect human judgments of the entailments generated by inserting and deleting adjectives from sentences drawn from large corpora. In this section, we motivate our design decisions, before carrying out our full analysis in Section 6.

Human judgments of entailment
People often draw conclusions based on "assumptions that seem plausible, rather than assumptions that are known to be true" (Kadmon, 2001). We therefore collect annotations on a 5-point scale, ranging from 1 (definite contradiction) to 5 (definite entailment), with 2 and 4 capturing likely (but not certain) contradiction/entailment respectively. We recruit annotators on Amazon Mechanical Turk. We tell each annotator to assume that the premise "is true, or describes a real scenario" and then, using their best judgement, to indicate how likely it is, on a scale of 1 to 5, that the hypothesis "is also true, or describes the same scenario." They are given short descriptions and several examples of sentence pairs that constitute each score along the 1 to 5 scale. They are also given the option to say that "the sentence does not make sense," to account for poorly constructed p/h pairs, or errors in our parsing. We use the mean score of the three annotators as the true score for each sentence pair.
Inter-annotator agreement. To ensure that our judgements are reproducible, we re-annotate a random 10% of our pairs, using the same annotation setup but a different set of annotators. We compute the intra-class correlation (ICC) between the scores received on the first round of annotation, and those received in the second pass. ICC is related to Pearson correlation, and is used to measure consistency among annotations when the group of annotators measuring each observation is not fixed, as opposed to metrics like Fleiss's κ which assume a fixed set of annotators. On our data, the ICC is 0.77 (95% CI 0.73 -0.81) indicating very high agreement. These twice-annotated pairs will become our test set in Section 7.

Data
Selecting contexts. We first investigate whether, in naturally occurring data, there is a difference between contexts in which the author uses the AN and contexts in which the author uses only the (unmodified) N. In other words, in order to study the effect of an A (e.g. financial) on the denotation of an N (e.g. system), is it better to look at contexts like (a) below, in which the author originally used the AN financial system, or to use contexts like (b), in which the author used only the N system?
(a) The TED spread is an indication of investor confidence in the U.S. financial system.
(b) Wellers hopes the system will be fully operational by 2015.
We will refer to contexts like (a) as natural contexts, and those like (b) as artificial. We take sample of 500 ANs from the Annotated Gigaword corpus (Napoles et al., 2012), and choose three natural and three artificial contexts for each. We generate p/h pairs by deleting/inserting the A for the natural/artificial contexts, respectively, and collect human judgements on the effect of the INS(A) operation for both cases. Figure 2 displays the results of this pilot study. In sentences which contain the AN naturally, there is a clear bias toward judgements of "entailment." That is, in contexts when an AN appears, it is often the case that this A is superfluous: the information carried by the A is sufficiently entailed by the context that removing it does not remove information. Sentences (a) and (b) above provide intuition: in the case of sentence (a), trigger phrases like investor confidence make it clear that the system we are discussing is the financial system, whether or not the adjective financial actually appears. No such triggers exist in sentence (b). Selecting ANs. We next investigate whether the frequency with which an AN is used effects its tendency to entail/be entailed by the head N. Again, we run a small pilot study. We choose 500 ANs stratified across different levels of frequency of occurrence in order to determine if sampling the most frequent ANs introduces bias into our annotation. We see no significant relationship between the frequency with which an AN appears and the entailment judgements we received.

Final design decisions
As a result of the above pilot experiments, we proceed with our study as follows. First, we use only artificial contexts, as we believe this will result in a greater variety of entailment relations and will avoid systematically biasing our judgements toward entailments. Second, we use the most frequent AN pairs, as these will better represent the types of ANs that NLU systems are likely to encounter in practice.
We look at four different corpora capturing four different genres: Annotated Gigaword (Napoles et al., 2012) (News), image captions (Young et al., 2014) (Image Captions), the Internet Argument Corpus (Walker et al., 2012) (Forums), and the prose fiction subset of GutenTag dataset (Brooke et al., 2015) (Literature). From each corpus, we select the 100 nouns which occur with the largest number of unique adjectives. Then, for each noun, we take the 10 adjectives with which the noun occurs most often. For each AN, we choose 3 contexts 2 in which the N appears unmodified, and generate p/h pairs by inserting the A into each.
We collect 3 judgements for each p/h pair. Since this task is subjective, and we want to focus our analysis on clean instances on which human agreement is high, we remove pairs for which one or more of the annotators chose the "does not make sense" option and pairs for which we do not have at least 2 out of 3 agreement (i.e. at least two workers must have chosen the same score on the 5-point scale). In the end, we have a total of 5,560 annotated p/h pairs 3 coming roughly evenly from our 4 genres. Figure 3 shows how the entailment relations are distributed in each genre. In Image Captions, the vast majority of ANs are in a forward entailment (restrictive) relation with their head N. In the other genres, however, a substantial fraction (36% for Forums) are in equivalence relations: i.e. the AN denotes the same set as is denoted by the N alone.

Empirical Analysis
When does N entail AN? If it is possible to insert adjectives into a sentence without adding new information, when does this happen? When is adjectival modification not restrictive? Based on our qualitative analysis, two clear patterns stand out: 1) When the adjective is prototypical of the noun it modifies. In general, we see that adding adjectives which are seen as attributes of the "prototypical" instance of the noun tend to generate entailments. E.g. people are generally comfortable concluding that beach→sandy beach. The same adjective may be prototypical and thus entailed in the context of one noun, but generate a contradiction in the context of another. E.g. if someone has a baby, it is probably fine to say they have a little baby, but if someone has control, it would be a lie to say they have little control ( Figure  4). 4

Empirical Analysis Empirical Analysis
Figure 4: Inserting adjectives that are seen as "prototypical" of the noun tends to generate entailments. E.g., beach generally entails sandy beach.
2) When the adjective invokes a sense of salience or importance. Nouns are assumed to be salient and relevant. E.g. answers are assumed (perhaps naively) to be correct, and problems are assumed (perhaps melodramatically) to be current and huge. Inserting adjectives like false or empty tend to generate contradictions ( Figure 5).
What do the different natural logic relations look like in practice? Table 3 shows examples of ANs and 4 These curves show the distribution over entailment scores associated with the INS(A) operation. Yellow curves show, for a single N, the distribution over all the As that modify it. Blue curves show, for a single A, the distribution over all the Ns it modifies. Figure 5: Unless otherwise specified, nouns are considered to be salient and relevant. Answers are assumed to be correct, and problems to be current.
contexts exhibiting each of the basic entailment relations. Some entailment inferences depend entirely on contextual information (Example 2a) while others arise from common-sense inference (Example 2b). Many of the most interesting examples fall into the independence relation. Recall from Section 3 that independence, in theory, covers ANs such as alleged criminal, in which the AN may or may not entail the N. In practice, the cases we observe falling into the independence relation tend to be those which are especially effected by world knowledge. In Example 3, local economy is considered to be independent of economy when used in the context of President Obama: i.e. the assumption that the president would be discussing the national economy is so strong that even when the president says the local economy is improving, people do not take this to mean that he has said the economy is improving.
Undefined entailment relations. Our annotation methodology-i.e. inferring entailment relations based on the entailments generated by INS and DEL editsdoes not enforce that all of the ANs fit into one of the five entailment relations defined by natural logic. Specifically, we observe many instances (∼5% of p/h pairs) in which INS is determined to generate a contradiction, while DEL is said to generate an entailment. In terms of set theory, this is equivalent to the (non-sensical) setting in which "every AN is an instance of N, but no N is an instance of AN." On inspection, these again represent cases in which commonsense assumptions dominate the inference. In Example 6, when given the premise Bush travels to Michigan to discuss the economy, annotators are confident enough that economy does not entail Japanese economy (why on earth would Bush travel to Michigan to discuss the Japanese economy?) that they label the insertion of Japanese as generating a contradiction. However, when presented with the p/h in the opposite direction, annotators agree that the Japanese economy does indeed entail the economy. These examples highlight the flexibility with which humans perform natural language inference, and the need for automated systems to  Table 3: Examples of ANs in context exhibiting each of the different entailment relations. Note that these are "artificial" contexts (Section 5.2), meaning the adjective was not originally a part of the sentence.
be equally flexible.
Take aways. Our analysis in this section results in three key conclusions about AN composition. 1) Despite common assumptions, adjectives do not always restrict the denotation of a noun. Rather, adjectival modification can result in a range of entailment relations, including equivalence and contradiction. 2) There are patterns to when the insertion of an adjective is or is not entailment-preserving, but recognizing these patterns requires common-sense and a notion of "prototypical" instances of nouns.
3) The entailment relation that holds between an AN and the head N is highly context dependent. These observations describe sizable obstacles for automatic NLU systems. Common-sense reasoning is still a major challenge for computers, both in terms of how to learn world knowledge and in how to represent it. In addition, context-sensitivity means that entailment properties of ANs cannot be simply stored in a lexicon and looked-up at run time. Such properties make AN composition an important problem on which to focus NLU research.

Benchmarking Current SOTA
We have highlighted why AN composition is an interesting and likely challenging phenomenon for automated NLU systems. We now turn our investigation to the performance of state-of-the-art RTE systems, in order to quantify how well AN composition is currently handled.
The Add-One Entailment Task. We define the "Add-One Entailment" task to be identical to the normal RTE task, except with the constraint that the premise p and the hypothesis h differ only by the atomic insertion of an adjective: h = e(p) where e=INS(A) and A is a single adjective. To provide a consistent interface with a range of RTE systems, we use a binary label set: NON-ENTAILMENT (which encompasses both CONTRADICTION and UNKNOWN) and ENTAILMENT. We want to test on only straightforward examples, so as not to punish systems for failing to classify examples which humans themselves find difficult to judge. In our test set, therefore, we label pairs with mean human scores ≤ 3 as NON-ENTAILMENT, pairs with scores ≥ 4 as ENTAILMENT, and throw away the pairs which fall into the ambigu-ous range in between. 5 Our resulting train, dev, and test sets contain 4,481, 510, and 387 pairs, respectively. These splits cover disjoint sets of ANs-i.e. none of the ANs appearing in test were seen in train. Individual adjectives and/or nouns can appear in both train and test. The dataset consists of roughly 85% NON-ENTAILMENT and 15% ENTAILMENT. Inter-annotator agreement achieves 93% accuracy.

RTE Systems
We test a variety of state-of-the-art RTE systems, covering several popular approaches to RTE. These systems are described in more detail below.
Classifier-based. The Excitement Open RTE platform (Magnini et al., 2014) includes a suite of RTE systems, including baseline systems as well as featurerich supervised systems which provide state-of-the-art performance on the RTE3 datasets (Giampiccolo et al., 2007). We test two systems from Excitement: the simple Maximum Entropy (MaxEnt) model which uses a suite of dense, similarity-based features (e.g. word overlap, cosine similarity), and the more sophisticated Maximum Entropy model (MaxEnt+LR) which uses the same similarity-based features but additionally incorporates features from external lexical resources such as WordNet (Miller, 1995) and VerbOcean (Chklovski and Pantel, 2004). We also train a standard unigram model (BOW).
Transformation-based. The Excitement platform also includes a transformation-based RTE system called BIUTEE (Stern and Dagan, 2012). The BIU-TEE system derives a sequence of edits that can be used to transform the premise into the hypothesis. These edits are represented using feature vectors, and the system searches over edit sequences for the lowest cost "proof" of either entailment or non-entailment. The feature weights are set by logistic regression during training.
Deep learning. Bowman et al. (2015a) recently reported very promising results using deep learning ar-chitectures and large training data for the RTE task. We test the performance of those same implementations on our Add-One task. Specifically, we test the following models: a basic Sum-of-words model (Sum), which represents both p and h as the sum of their word embeddings, an RNN model, and an LSTM model. We also train a bag-of-vectors model (BOV), which is simply a logistic regression whose features are the concatenated averaged word embeddings of p and h.
For the LSTM, in addition to the normal training setting-i.e. training only on the 5K Add-One training pairs-we test a transfer-learning setting (Transfer). In transfer learning, the model trains first on a large general dataset before fine-tuning its parameters on the smaller set of target-domain training data. For our Transfer model, we train first on the 500K pair SNLI dataset (Bowman et al., 2015a) until convergence, and then fine-tune on the 5K Add-One pairs. This setup enabled Bowman et al. (2015a) to train a high-performance LSTM for the SICK dataset, which is of similar size to our Add-One dataset (∼5K training pairs).

Results
Out of the box performances. To calibrate expectations, we first report the performance of each of the systems on the datasets for which they were originally designed. For the Excitement systems, this is the RTE3 dataset (Table 6a). For the deep learning systems, this is the SNLI dataset (Table 6b). For the deep learning systems, in addition to reporting performance when trained on the SNLI corpus (500K p/h pairs), we report the performance in a reduced training setting in which systems only have access to 5K p/h pairs. This is equivalent to the amount of data we have available for the Add-One task, and is intended to give a sense of the performance improvements we should expect from these systems given the size of the training data.  Figure 6: Performance of SOTA systems on the datasets for which they were originally developed.
7.3 Performance on Add-One RTE.
Finally, we train each of the systems on the 5,000 Add-One p/h pairs in our dataset and test on our heldout set of 387 pairs. Figure 7 reports the results in terms of accuracy and precision/recall for the ENTAIL-MENT class. The baseline strategy of predicting the majority class for each adjective, based on the training data, reaches close to human performance (92% accuracy). Given the simplicity of the task (p and h differ by a single word), this baseline strategy should be achievable. However, none of the systems tested come close to this level of performance, suggesting that they fail to learn even the most-likely entailment generated by adjectives (e.g. that INS(brown) probably generates NON-ENTAILMENT and INS(possible) probably generates ENTAILMENT). The best performing system is the RNN, which achieves 87% accuracy, only two points above the baseline of always guessing NON-ENTAILMENT. Figure 7: Performances of all systems on AddOne RTE task. The strategy of predicting the majority class for each adjective-based on the training data-reaches near human performance. None of the systems tested come close to human levels, indicating that the systems fail even to memorize the most-likely class for each adjective in training.

Related Work
Past work, both in linguistics and in NLP, has explored different classes of adjectives (e.g. privative, intensional) as they relate to entailment (Kamp and Partee, 1995;Partee, 2007;Boleda et al., 2013;Nayak et al., 2014). In general, prior studies have focused on modeling properties of the adjectives alone, ignoring the context-dependent nature of AN/N entailments-i.e. in prior work little is always restrictive, whether it is modifying baby or control. Pustejovsky (2013) offer a preliminary analysis of the contextual complexities surrounding adjective inference, which reinforces many of the observations we have made here. Hartung and Frank (2011) analyze adjectives in terms of the properties they modify but don't address them from an entailment perspective. Tien Nguyen et al. (2014) look at the adjectives in the restricted domain of computer vision.
Other past work has employed first-order logic and other formal representations of adjectives in order to provide compositional entailment predictions (Amoia and Gardent, 2006;Amoia and Gardent, 2007;Mc-Crae et al., 2014). Although theoretically appealing, such rigid logics are unlikely to provide the flexibility needed to handle the type of common-sense inferences we have discussed here. Distributional representations provide much greater flexibility in terms of representation (Baroni and Zamparelli, 2010;Guevara, 2010;Boleda et al., 2013). However, work on distributional AN composition has so far remained out-of-context, and has mostly been evaluated in terms of overall "similarity" rather than directly addressing the entailment properties associated with composition.

Conclusion
We have investigated the problem of adjective-noun composition, specifically in relation to the task of RTE. AN composition is capable of producing a range of natural logic entailment relationship, at odds with commonly-used heuristics which treat all adjectives a restrictive. We have shown that predicting these entailment relations is dependent on context and on world knowledge, making it a difficult problem for current NLU technologies. When tested, state-of-the-art RTE systems fail to learn to differentiate entailmentpreserving insertions of adjectives from non-entailing ones. This is an important distinction for carrying out human-like reasoning, and our results reveal important weaknesses in the representations and algorithms employed by current NLU systems. The Add-One Entailment task we have introduced will allow ongoing RTE research to better diagnose systems' abilities to capture these subtleties of ANs, which that have practical effects on natural language inference.