CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way. In this work, we introduce a diagnostic benchmark suite, named CLUTRR, to clarify some key issues related to the robustness and systematicity of NLU systems. Motivated by the classic work on inductive logic programming, CLUTRR requires that an NLU system infer kinship relations between characters in short stories. Successful performance on this task requires both extracting relationships between entities, as well as inferring the logical rules governing these relationships. CLUTRR allows us to precisely measure a model’s ability for systematic generalization by evaluating on held-out combinations of logical rules, and allows us to evaluate a model’s robustness by adding curated noise facts. Our empirical results highlight a substantial performance gap between state-of-the-art NLU models (e.g., BERT and MAC) and a graph neural network model that works directly with symbolic inputs—with the graph-based model exhibiting both stronger generalization and greater robustness.


Introduction
Natural language understanding (NLU) systems have been extremely successful at reading comprehension tasks, such as question answering (QA) and natural language inference (NLI). An array of existing datasets are available for these tasks. This includes datasets that test a system's ability to extract factual answers from text (Rajpurkar et al., 2016;Nguyen et al., 2016;Trischler et al., 2016;Mostafazadeh et al., 2016;Su et al., 2016), as well as datasets that emphasize commonsense inference, such as entailment between sentences (Bowman et al., 2015;Williams et al., 2018). However, there are growing concerns regarding the ability of NLU systems-and neural networks more generally-to generalize in a systematic and robust way (Bahdanau et al., 2019;Lake and Baroni, 2018;Johnson et al., 2017). For instance, recent work has highlighted the brittleness of NLU systems to adversarial examples (Jia and Liang, 2017), as well as the fact that NLU models tend to exploit statistical artifacts in datasets, rather than exhibiting true reasoning and generalization capabilities (Gururangan et al., 2018;Kaushik and Lipton, 2018). These findings have also dovetailed with the recent dominance of large pre-trained language models, such as BERT, on NLU benchmarks (Devlin et al., 2018;Peters et al., 2018), which suggest that the primary difficulty in these datasets is incorporating the statistics of the natural language, rather than reasoning.
An important challenge is thus to develop NLU benchmarks that can precisely test a model's capability for robust and systematic generalization. Ideally, we want language understanding systems that can not only answer questions and draw inferences from text, but that can also do so in a systematic, logical, and robust way. While such reasoning capabilities are certainly required for many existing NLU tasks, most datasets combine several challenges of language understanding into one, such as co-reference/entity resolution, incorporating world knowledge, and semantic parsing-making it difficult to isolate and diagnose a model's capabilities for systematic generalization and robustness.
Our work. Inspired by the classic AI challenge of inductive logic programming (Quinlan, 1990)as well as the recently developed CLEVR dataset for visual reasoning (Johnson et al., 2017)-we propose a semi-synthetic benchmark designed to explicitly test an NLU model's ability for systematic and robust logical generalization.
Our benchmark suite-termed CLUTRR (Compositional Language Understanding and Text-based Relational Reasoning)-contains a large set of semi-synthetic stories involving hypothetical families. Given a story, the goal is to infer the relationship between two family members, whose relationship is not explicitly mentioned ( Figure 1). To solve this task, a learning agent must extract the relationships mentioned in the text, induce the logical rules governing the kinship relationships (e.g., the transitivity of the sibling relation), and use these rules to infer the relationship between a given pair of entities. Crucially, the CLUTRR benchmark allows us to test a learning agent's ability for systematic generalization by testing on stories that contain unseen combinations of logical rules. CLUTRR also allows us to precisely test for the various forms of model robustness by adding different kinds of superfluous noise facts to the stories.
We compare the performance of several stateof-the-art NLU systems on this task-including Relation Networks (Santoro et al., 2017), Compositional Attention Networks (Hudson and Manning, 2018) and BERT (Devlin et al., 2018). We find that the generalization ability of these NLU systems is substantially below that of a Graph Attention Network (Veličković et al., 2018), which is given direct access to symbolic representations of the stories. Moreover, we find that the robustness of the NLU systems generally does not improve by training on noisy data, whereas the GAT model is able to effectively learn robust reasoning strategies by training on noisy examples. Both of these results highlight important open challenges for closing the gap between machine reasoning models that work with unstructured text and models that are given access to more structured input.

Related Work
We draw inspiration from the classic work on inductive logic programming (ILP), a long line of reading comprehension benchmarks in NLP, as well as work combining language and knowledge graphs.
Reading comprehension benchmarks. Many datasets have been proposed to test the reading comprehension ability of NLP systems. This includes the SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2016), and MCTest (Richardson et al., 2013) benchmarks that focus on factual questions; the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) benchmarks for sentence understanding; and the bABI tasks (Weston et al., 2015), to name a few. Our primary contribution to this line of work is the development of a carefully designed diagnostic benchmark to evaluate model robustness and systematic generalization in the context of NLU. Question-answering with knowledge graphs. Our work is also related to the domain of question answering and reasoning in knowledge graphs (Das et al., 2018;Xiong et al., 2018;Hamilton et al., 2018;Xiong et al., 2017;Welbl et al., 2018;Kartsaklis et al., 2018), where either the model is provided with a knowledge graph to perform inference over or where the model must infer a knowledge graph from the text itself. However, unlike previous benchmarks in this domain-which are generally transductive and focus on leveraging and extracting knowledge graphs as a source of background knowledge about a fixed set of entities-CLUTRR requires inductive logical reasoning, where every example requires reasoning over a new set of previously unseen entities.

Benchmark Design
In order to design an NLU benchmark that explicitly tests inductive reasoning and systematic generalization, we build upon the classic ILP task of inferring family (i.e., kinship) relations (Hinton et al., 1986;Muggleton, 1991;Lavrac and Dzeroski, 1994;Kok and Domingos, 2007;Rocktäschel and Riedel, 2017). For example, given the facts that "Alice is Bob's mother" and "Jim is Alice's father", one can infer with reasonable certainty that "Jim is Bob's grandfather." While this example may appear trivial, it is a challenging task to design models that can learn from data to induce the logical rules necessary to make such inferences, and it is even more challenging to design models that can systematically generalize by composing these induced rules. Inspired by this classic task of logical induction and reasoning, the CLUTRR benchmark requires an NLU system to infer and reason about kinship Step 1: generate a kinship graph.
Step 2: sample a target fact.
Step 3: Use backward chaining to sample a set of facts.
Step 4: Convert sampled facts to a natural language story. relations by reading short stories. Requiring that the models learn directly from natural language makes this task much more challenging than the purely symbolic ILP setting. However, we leverage insights from traditional ILP to generate these stories in a semi-synthetic manner, providing precise control over the complexity of the reasoning required to solve the task.
In its entirety, the CLUTRR benchmark suite allows researchers to generate diverse semi-synthetic short stories to test different aspects of inductive reasoning capabilities. We publicly release the entire benchmark suite, including code to generate the semi-synthetic examples, the specific datasets that we introduce here, and the different baselines that we compare with. 1

Overview of data generation process
The core idea behind the CLUTRR benchmark suite is the following: Given a natural language story describing a set of kinship relations, the goal is to infer the relationship between two entities, whose relationship is not explicitly stated in the story. To generate these stories, we first design a knowledge base (KB) with rules specifying how kinship relations resolve, and we use the following steps to create semi-synthetic stories based on this knowledge base: Step 1. Generate a random kinship graph that satisfies the rules in our KB.
Step 2. Sample a target fact (i.e., relation) to predict from the kinship graph.
Step 3. Apply backward chaining to sample a set of facts that can prove the target relation (and optionally sample a set of "distracting" or "irrelevant" noise facts).
Step 4. Convert the sampled facts into a natural language story through pre-specified text templates and crowd-sourced paraphrasing. Figure 2 provides a high-level overview of this idea, and the following subsections describe the data generation process in detail, as well as the diagnostic flexibility afforded by CLUTRR.

Story generation
The short stories in CLUTRR are essentially narrativized renderings of a set of logical facts. In this section, we describe how we sample the logical facts that make up a story by generating random kinship graphs and using backward chaining to produce logical reasoning chains. The conversion from logical facts to natural language narratives is then described in Section 3.3. Terminology and background. Following standard practice in formal semantics, we use the term atom to refer to a predicate symbol and a list of terms, such as [grandfatherOf, X, Y ], where the predicate grandfatherOf denotes the relation between the two variables, X and Y . We restrict the predicates to have an arity of 2, i.e., binary predicates. A logical rule in this setting is of the form H B, where B is the body of the rule, i.e., a conjunction of two atoms ([α 1 , α 2 ]) and H is the head, i.e., a single atom (α) that can be viewed as the goal or query. For instance, given a knowledge base (KB) R that contains the single rule [grandfatherOf, is also true in a given world. A rule is called a grounded rule if all atoms in the rule are themselves grounded, i.e., all variables are replaced with constants or entities in a world. A fact is a grounded binary predicate. A clause is a conjunction of two or more atoms (C = (H C B C = ([α 1 , ..., α n ]))) which can be built using a set of rules.
In the context of our data generation process, we distinguish between the knowledge base, R, which contains a finite number of predicates and rules specifying how kinship relations in a family resolve, and a particular kinship graph G, which contains a grounded set of atoms specifying the particular kinship relations that underlie a single story. In other words, R contains the logical rules that govern all the generated stories in CLUTRR, while G contains the grounded facts that underlie a specific story. Graph generation. To generate the kinship graph G underlying a particular story, we first sample a set of gendered 2 entities and kinship relations using a stochastic generation process. This generation process contains a number of tunable parameterssuch as the maximum number of children at each node, the probability of an entity being married to another entity, etc.-and is designed to produce a valid, but possibly incomplete "backbone graph". For instance, this backbone graph generation process will specify "parent"/"child" relations between entities but does not add "grandparent" relations. After this initial generation process, we recursively apply the logical rules in R to the backbone graph to produce a final graph G that contains the full set of kinship relations between all the entities. Backward chaining. The resulting graph G provides the background knowledge for a specific story, as each edge in this graph can be treated as a grounded predicate (i.e., fact) between two entities. From this graph G, we sample the facts that make up the story, as well as the target fact that we seek to predict: First, we (uniformly) sample a target relation H C , which is the fact that we want to predict from the story. Then, from this target relation H C , we run a simple variation of the backward chaining (Gallaire and Minker, 1978) algorithm for k iterations starting from H C , where at each iteration we uniformly sample a subgoal to resolve and then uniformly sample a KB rule that resolves this subgoal. Crucially, unlike traditional backward chaining, we do not stop the algorithm when a proof is obtained; instead, we run for a fixed number of iterations k in order to sample a set of k facts B C that imply the target relation H C .

Adding natural language
So far, we have described the process of generating a conjunctive logical clause C = (H C B C ), where H C = [α * ] is the target fact (i.e., relation) we seek to predict and B C = [α 1 , ..., α k ] is the set of supporting facts that imply the target relation. We now describe how we convert this logical represen-2 Kinship and gender roles are oversimplified in our data (compared to the real world) to maintain tractability. tation to natural language through crowd-sourcing. Paraphrasing using Amazon Mechanical Turk. The basic idea behind our approach is that we show Amazon Mechanical Turk (AMT) crowd-workers the set of facts B C corresponding to a story and ask the workers to paraphrase these facts into a narrative. Since workers are given a set of facts B C to work from, they are able to combine and split multiple facts across separate sentences and construct diverse narratives (Figure 3). Appendix 1.6 contains further details on our AMT interface (based on the ParlAI framework (Miller et al., 2017)), data collection, and the quality controls we employed. Reusability and composition. One challenge for data collection via AMT is that the number of possible stories generated by CLUTRR grows combinatorially as the number of supporting facts increases, i.e., as k = |B C | grows. This combinatorial explosion for large k-combined with the difficulty of maintaining the quality of the crowd-sourced paraphrasing for long stories-makes it infeasible to obtain a large number of paraphrased examples for k > 3. To circumvent this issue and increase the flexibility of our benchmark, we reuse and compose AMT paraphrases to generate longer stories. In particular, we collected paraphrases for stories containing k = 1, 2, 3 supporting facts and then replaced the entities from these collected stories with placeholders in order to re-use them to generate longer semi-synthetic stories. An example of a story generated by stitching together two shorter paraphrases is provided below: [Frank]  Thus, instead of simply collecting paraphrases for a fixed number of stories, we instead obtain a diverse To get a sense of the data quality and difficulty involved in CLUTRR, we asked human annotators to solve the task for random examples of length k = 2, 3, ..., 6. We found that time-constrained AMT annotators performed well (i.e., > 70%) accuracy for k ≤ 3 but struggled with examples involving longer stories, achieving 40-50% accuracy for k > 3. However, trained annotators with unlimited time were able to solve 100% of the examples (Appendix 1.7), highlighting the fact that this task requires attention and involved reasoning, even for humans.

Query representation and inference
Representing the question. The AMT paraphrasing approach described above allows us to convert the set of supporting facts B C to a natural language story, which can be used to predict the target relation/query H C . However, instead of converting the target query, H C = [α * ], to a natural language question, we instead opt to represent the target query as a K-way classification task, where the two entities in the target relation are provided as input and the goal is to classify the relation that holds between these two entities. This representation avoids the pitfall of revealing information about the answer in the question (Kaushik and Lipton, 2018). Representing entities. When generating stories, entity names are randomly drawn from a set of 300 common gendered English names. Thus, depending on each run, the entities are never the same. This ensures that the entity names are simply placeholders and uncorrelated from the task.

Variants of CLUTRR
The modular nature of CLUTRR provides rich diagnostic capabilities for evaluating the robustness and generalization abilities of neural language understanding systems. We highlight some key diagnostic capabilities available via different variations of CLUTRR below. These diagnostic variations correspond to the concrete datasets that we generated in this work, and we describe the results on these Datasets in Section 4. Systematic generalization. Most prominently, CLUTRR allows us to explicitly evaluate a model's ability for systematic generalization. In particular, we rely on the following hold-out procedures to test systematic generalization: • During training, we hold out a subset of the collected paraphrases, and we only use this held-out subset of paraphrases when generating the test set. Thus, to succeed on CLUTRR, an NLU system must exhibit linguistic generalization and be robust to linguistic variation at test time.
• We also hold out a subset of the logical clauses during training (for clauses of length k > 2). 3 In other words, during training, the model sees all logical rules but does not see all combinations of these logical rules. Thus, in addition to linguistic generalization, success on this task also requires logical generalization.
• Lastly, as a more extreme form of both logical and linguistic generalization, we consider the setting where the models are trained on stories generated from clauses of length ≤ k and evaluated on stories generated from larger clauses of length > k. Thus, we explicitly test the ability for models to generalize on examples that require more steps of reasoning that any example they encountered during training. Robust Reasoning. In addition to evaluating systematic generalization, the modular setup of CLUTRR also allows us to diagnose model robustness by adding noise facts to the generated narratives. Due to the controlled semi-synthetic nature of CLUTRR, we are able to provide a precise taxonomy of the kinds of noise facts that can be added ( Figure 4). In order to structure this taxonomy, it is important to recall that any set of supporting facts B C generated by CLUTRR can be interpreted as a path, p C , in the corresponding kinship graph G (Figure 2). Based on this interpretation, we view adding noise facts from the perspective of sampling three different types of noise paths, p n , from the kinship graph G: • Irrelevant facts: We add a path p n , which has exactly one shared end-point with p c . In this way, this is a distractor path, which contains facts that are connected to one of the entities in the target relation, H C , but do not provide any information that could be used to help answer the query. • Supporting facts: We add a path p n , whose two end-points are on the path p C . The facts on this path p n are noise because they are not needed to answer the query, but they are supporting facts because they can, in principle, be used to construct alternative (longer) reasoning paths that connect the two target entities. • Disconnected facts: We add paths which neither originate nor end in any entity on p c . These disconnected facts involve entities and relations that are completely unrelated to the target query.

Experiments
We evaluate several neural language understanding systems on the proposed CLUTRR benchmark to surface the relative strengths and shortcomings of these models in the context of inductive reasoning and combinatorial generalization. 4 We aim to answer the following key questions: (Q1) How do state-of-the-art NLU models compare in terms of systematic generalization? Can these models generalize to stories with unseen combinations of logical rules? (Q2) How does the performance of neural language understanding models compare to a graph neural network that has full access to graph structure underlying the stories? (Q3) How robust are these models to the addition of noise facts to a given story?

Baselines
Our primary baselines are neural language understanding models that take unstructured text as input. We consider bidirectional LSTMs (Hochreiter and Schmidhuber, 1997;Cho et al., 2014) (with and without attention), as well as recently proposed models that aim to incorporate inductive biases towards relational reasoning: Relation Networks (RN) (Santoro et al., 2017) and Compositional Memory Attention Network (MAC) (Hudson and Manning, 2018). We also use the large pretrained language model, BERT (Devlin et al., 2018), as well as a modified version of BERT having a trainable LSTM encoder on top of the pretrained BERT embeddings. All of these models (except BERT) were re-implemented in PyTorch 1.0 (Paszke et al., 2017) and adapted to work with the CLUTRR benchmark.
Since the underlying relations in the stories generated by CLUTRR inherently form a graph, we also experiment with a Graph Attention Network (GAT) (Veličković et al., 2018). Rather than taking the textual stories as input, the GAT baseline receives a structured graph representation of the facts that underlie the story. Entity and query representations. We use the various baseline models to encode the natural language story (or graph) into a fixed-dimensional embedding. With the exception of the BERT models, we do not use pre-trained word embeddings and learn the word embeddings from scratch using endto-end backpropagation. An important note, however, is that we perform Cloze-style anonymization (Hermann et al., 2015) of the entities (i.e., names) in the stories, where each entity name is replaced by a @entity-k placeholder, which is randomly sampled from a small, fixed pool of placeholder tokens. The embeddings for these placeholders are randomly initialized and fixed during training. 5 To make a prediction about a target query given a story, we concatenate the embedding of the story (generated by the baseline model) with the embeddings of the two target entities and we feed this concatenated embedding to a 2-layer feed-forward neural network with a softmax prediction layer.

Experimental Setup
Hyperparameters. We selected hyperparameters for all models using an initial grid search on the systematic generalization task (described below). All models were trained for 100 epochs with Adam optimizer and a learning rate of 0.001. The Appendix provides details on the selected hyperparameters. Generated datasets. For all experiments, we generated datasets with 10-15k training examples. In many experiments, we report training and testing results on stories with different clause lengths k. (For brevity, we use the phrase "clause length" throughout this section to refer to the value k = |B C |, i.e., the number of steps of reasoning that are required to predict the target query.) In all cases, the training set contains 5000 train stories per k value, and, during testing, all experiments use 100 test stories per k value. All experiments were run 10 times with different randomly generated stories, and means and standard errors over these 10 runs are reported. As discussed in Section 3.5, during training we holdout 20% of the paraphrases, as well as 10% of the possible logical clauses.

Results and Discussion
With our experimental setup in place, we now address the three key questions (Q1-Q3) outlined at the beginning of Section 4. bedding approaches.

Q1: Systematic Generalization
We begin by using CLUTRR to evaluate the ability of the baseline models to perform systematic generalization (Q1). In this setting, we consider two training regimes: in the first regime, we train all models with clauses of length k = 2, 3, and in the second regime, we train with clauses of length k = 2, 3, 4. We then test the generalization of these models on test clauses of length k = 2, ..., 10. Figure 5 illustrates the performance of different models on this generalization task. We observe that the GAT model is able to perform near-perfectly on the held-out logical clauses of length k = 3, with the BERT-LSTM being the top-performer among the text-based models but still significantly below the GAT. Not surprisingly, the performance of all models degrades monotonically as we increase the length of the test clauses, which highlights the challenge of "zero-shot" systematic generalization (Lake and Baroni, 2018;Sodhani et al., 2018). However, as expected, all models improve on their generalization performance when trained on k = 2, 3, 4 rather than just k = 2, 3 ( Figure 5, right). The GAT, in particular, achieves the biggest gain by this expanded training.

Q2: The Benefit of Structure
The empirical results on systematic generalization also provide insight into how the text-based NLU systems compare against the graph-based GAT model that has full access to the logical graph structure underlying the stories (Q2). Indeed, the relatively strong performance of the GAT model (Figure 5) suggests that the language-based models fail to learn a robust mapping from the natural language narratives to the underlying logical facts.
To further confirm this trend, we ran experiments with modified train and test splits for the text-based models, where the same set of natural language paraphrases were used to construct the narratives in both the train and test splits (see Appendix 1.3 for details). In this simplified setting, the text-based models must still learn to reason about held-out logical patterns, but the difficulty of parsing the natural language is essentially removed, as the same natural language paraphrases are used during testing and training. We found that the text-based models were competitive with the GAT model in this simplified setting (Appendix Figure 1), confirming that the poor performance of the text-based models on the main task is driven by the difficulty of parsing the unseen natural language narratives.

Q3: Robust Reasoning
Finally, we use CLUTRR to systematically evaluate how various baseline neural language understanding systems cope with noise (Q3). In all the experiments we provide a combination of k = 2 and k = 3 length clauses in training and testing, with noise facts being added to the train and/or test set depending on the setting (Table 2). We use the different types of noise facts defined in Section 3.5. Overall, we find that the GAT baseline outperforms the unstructured text-based models across most testing scenarios (Table 2), which showcases the benefit of a structured feature space for robust reasoning. When training on clean data and testing on noisy data, we observe two interesting trends that highlight the benefits and shortcomings of the various model classes: 1. All the text-based models excluding BERT actually perform better when testing on examples that have supporting or irrelevant facts added. This suggests that these models actually benefit from having more content related to the entities in the story. Even though this content is not strictly useful or needed for the reasoning task, it may provide some linguistic cues (e.g., about entity genders) that the models exploit. In contrast, the BERT-based models do not benefit from the inclusion of this extra content, which is perhaps due to the fact that they are already built upon a strong language model (e.g., that already adequately captures entity genders.) 2. The GAT model performs poorly when supporting facts are added but has no performance drop when disconnected facts are added. This suggests that the GAT model is sensitive to changes that introduce cycles in the underlying graph structure but is robust to the addition of noise that is disconnected from the target entities.
Moreover, when we trained on noisy examples, we found that only the GAT model was able to consistently improve its performance (Table 2). Again, this highlights the performance gap between the unstructured text-based models and the GAT.

Conclusion
In this paper we introduced the CLUTRR benchmark suite to test the systematic generalization and inductive reasoning capababilities of NLU systems. We demonstrated the diagnostic capabilities of CLUTRR and found that existing NLU systems exhibit relatively poor robustness and systematic generalization capabilities-especially when compared to a graph neural network that works directly with symbolic input. These results highlight the gap that remains between machine reasoning models that work with unstructured text and models that are given access to more structured input. We hope that by using this benchmark suite, progress can be made in building more compositional, modular, and robust NLU systems.