SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

This paper proposes a question-answering (QA) benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior work and is challenging for state-of-the-art language models (LM). We propose a distant supervision method to improve on this task. Specifically, we design grammar and reasoning rules to automatically generate a spatial description of visual scenes and corresponding QA pairs. Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ. We hope that this work can foster investigations into more sophisticated models for spatial reasoning over text.


Introduction
Spatial reasoning is a cognitive process based on the construction of mental representations for spatial objects, relations, and transformations (Clements and Battista, 1992), which is necessary for many natural language understanding (NLU) tasks such as natural language navigation (Chen et al., 2019;Roman Roman et al., 2020;Kim et al., 2020), human-machine interaction (Landsiedel et al., 2017;Roman Roman et al., 2020), dialogue systems (Udagawa et al., 2020), and clinical analysis (Datta and Roberts, 2020).
Modern language models (LM), e.g., BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), and XLNet (Yang et al., 2019) have seen great successes in natural language processing (NLP). However, there has been limited investigation into spatial reasoning capabilities of LMs. To the best of our knowledge, bAbI (Weston et al., 2015) (Fig 9) is the only dataset with direct textual spatial question answering (QA) (Task 17), but it is synthetic * Work was done while at the Allen Institute for AI. and overly simplified: (1) The underlying scenes are spatially simple, with only three objects and relations only in four directions. (2) The stories for these scenes are two short, templated sentences, each describing a single relation between two objects. (3) The questions typically require up to two-steps reasoning due to the simplicity of those stories.
To address these issues, this paper proposes a new dataset, SPARTQA 1 (see Fig. 1). Specifically, (1) SPARTQA is built on NLVR's (Suhr et al., 2017) images containing more objects with richer spatial structures (Fig. 1b). (2) SPARTQA's stories are more natural, have more sentences, and richer in spatial relations in each sentence. (3) SPARTQA's questions require deeper reasoning and have four types: find relation (FR), find blocks (FB), choose object (CO), and yes/no (YN), which allows for more fine-grained analysis of models' capabilities.
We showed annotators random images from NLVR, and instructed them to describe objects and relationships not exhaustively at the cost of naturalness (Sec. 3). In total, we obtained 1.1k unique QA pair annotations on spatial reasoning, evenly distributed among the aforementioned types. Similar to bAbI, we keep this dataset in relatively small scale and suggest to use as little training data as possible. Experiments show that modern LMs (e.g., BERT) do not perform well in this low-resource setting.
This paper thus proposes a way to obtain distant supervision signals for spatial reasoning (Sec. 4). As spatial relationships are rarely mentioned in existing corpora, we take advantage of the fact that spatial language is grounded to the geometry of visual scenes. We are able to automatically generate stories for NLVR images (Suhr et al., 2017) via our newly designed context free grammars (CFG) and context-sensitive rules. In the process of story generation, we store the information about all ob-4583 QUESTIONS: FB: Which block(s) has a medium thing that is below a black square? A, B, C FB: Which block(s) doesn't have any blue square that is to the left of a medium square? A, B FR: What is the relation between the medium black square which is in block C and the medium square that is below a medium black square that is touching the bottom edge of a block? Left CO: Which object is above a medium black square? the medium black square which is in block C or medium black square number two? medium black square number two YN: Is there a square that is below medium square number two above all medium black squares that are touching the bottom edge of a block? Yes

STORY:
We have three blocks, A, B and C. Block B is to the right of block C and it is below block A. Block A has two black medium squares. Medium black square number one is below medium black square number two and a medium blue square. It is touching the bottom edge of this block. The medium blue square is below medium black square number two. Block B contains one medium black square. Block C contains one medium blue square and one medium black square. The medium blue square is below the medium black square.  jects and relationships, such that QA pairs can also be generated automatically. In contrast to bAbI, we use various spatial rules to infer new relationships in these QA pairs, which requires more complex reasoning capabilities. Hereafter, we call this automatically-generated dataset SPARTQA-AUTO, and the human-annotated one SPARTQA-HUMAN. Experiments show that, by further pretraining on SPARTQA-AUTO, we improve LMs' performance on SPARTQA-HUMAN by a large margin. 2 The spatially-improved LMs also show stronger performance on two external QA datasets, bAbI and boolQ (Clark et al., 2019): BERT further pretrained on SPARTQA-AUTO only requires half of the training data to achieve 99% accuracy on bAbI as compared to the original BERT; on boolQ's development set, this model shows better performance than BERT, with 2.3% relative error reduction. 3 2 Further pretraining LMs has become a common practice and baseline method for transferring knowledge between tasks (Phang et al., 2018;Zhou et al., 2020). We leave more advanced methods for future work. 3 To the best of our knowledge, the test set or leaderboard of boolQ has not been released yet.
Our contributions can be summarized as follows. First, we propose the first human-curated benchmark, SPARTQA-HUMAN, for spatial reasoning with richer spatial phenomena than the prior synthetic dataset bAbI (Task 17).
Second, we exploit the scene structure of images and design novel CFGs and spatial reasoning rules to automatically generate data (i.e., SPARTQA-AUTO) to obtain distant supervision signals for spatial reasoning over text.
Third, SPARTQA-AUTO proves to be a rich source of spatial knowledge that improved the performance of LMs on SPARTQA-HUMAN as well as on different data domains such as bAbI and boolQ.

Related work
Question answering is a useful format to evaluate machines' capability of reading comprehension (Gardner et al., 2019) and many recent works have been implementing this strategy to test machines' understanding of linguistic formalisms: He et al. and Cardie (2020). An important advantage of QA is using natural language to annotate natural language, thus having the flexibility to get annotations on complex phenomena such as spatial reasoning. However, spatial reasoning phenomena have been covered minimally in the existing works.
To the best of our knowledge, Task 17 of the bAbI project (Weston et al., 2015) is the only QA dataset focused on textual spatial reasoning (examples in Appendix F). However, bAbI is synthetic and does not reflect the complexity of the spatial reasoning in natural language. Solving Task 17 of bAbI typically does not require sophisticated reasoning, which is an important capability emphasized by more recent works (e.g., Dua et al. Spatial reasoning is arguably more prominent in multi-modal QA benchmarks, e.g., NLVR (Suhr et al., 2017), VQA (Antol et al., 2015), GQA (Hudson and Manning, 2019), CLEVR (Johnson et al., 2017). However, those spatial reasoning phenomena are mostly expressed naturally through images, while this paper focuses on studying spatial reasoning on natural language. Some other works on visual-spatial reasoning are based on geographical information inside maps and diagrams (Huang et al., 2019) and navigational instructions (Chen et al., 2019;Anderson et al., 2018).
As another approach to evaluate spatial reasoning capabilities of models, a dataset proposed in Ghanimifard and Dobnik (2017) generates a synthetic training set of spatial sentences and evaluates the models' ability to generate spatial facts and sentences containing composition and decomposition of relations on grounded objects.

SPARTQA-HUMAN
To mitigate the aforementioned problems of Task 17 of bAbI, i.e., simple scenes, stories, and questions, we describe the data annotation process of SPARTQA-HUMAN, and explain how those problems were addressed in this section.
First, we randomly selected a subset of NLVR images, each of which has three blocks containing multiple objects (see Fig 1b). The scenes shown by these images are more complicated than those described by bAbI because (1) there are more objects in NLVR images; (2) the spatial relationships in NLVR are not limited to just four relative directions as objects are placed arbitrarily within blocks. Figure 2: For "A blue circle is above a big triangle. To the left of the big triangle, there is a square," if the question is: "Is the square to the left of the blue circle?", the answer is neither Yes nor No. Thus, the correct answer is "Do not Know" (DK) in our setting.
Second, two student volunteers produced textual description of those objects and their corresponding spatial relationships based on these images. Since the blocks are always horizontally aligned in each NLVR image, to allow for more flexibility, annotators could also rearrange these blocks (see Fig. 1a). Relationships between objects within the same block can take the forms of relative direction (e.g., left or above), qualitative distance (e.g., near or far), and topological relationship (e.g., touching or containing).
However, we instructed the annotators not to describe all objects and relationships, (1) to avoid unnecessarily verbose stories, and (2) to intentionally miss some information to enable more complex reasoning later. Therefore, annotators describe only a random subset of blocks, objects, and relationships.
To query more interesting phenomena, annotators were then encouraged to write questions requiring detecting relations and reasoning over them using multiple spatial rules. A spatial rule can be one of the transitivity (A → B, B → C ⇒ A → C), symmetry (A → B ⇒ B → A), converse ((A, R, B) ⇒ (B, reverse(R), A)), inclusion (obj1 in A), and exclusion (obj1 not in B) rules.
There are four types of questions (Q-TYPE).
(1) FR: find relation between two objects. (2) FB: find the block that contains certain object(s). (3) CO: choose between two objects mentioned in the question that meets certain criteria. (4) YN: a yes/no question that tests if a claim on spatial relationship holds.
FB, FR, and CO questions are formulated as multiple-choice questions 4 and receive a list of candidate answers, and YN questions' answer is choosing from Yes, No, or "DK" (Do not Know). The "DK" option is due to the open-world assumption of the stories, where if something is not described in the text, it is not considered as false (See Fig. 2). Finally, annotators were able to create 1.1k QA pairs on spatial reasoning on the generated descriptions, distributed among the aforementioned types. We intentionally keep this data in a relatively small scale due to two reasons. First, there has been some consensus in our community that modern systems, given their sufficiently large model capacities, can easily find shortcuts and overfit a dataset if provided with a large training data (Gardner et al., 2020;Sen and Saffari, 2020). Second, collecting spatial reasoning QAs is very costly: The two annotators spent 45-60 mins on average to create a single story with 8-16 QA pairs. We estimate that SPARTQA-HUMAN costed about 100 human hours in total. The expert performance on 100 examples of SPARTQA-HUMAN's test set measured by their accuracy of answering the questions is 92% across four Q-TYPEs on average, indicating its high quality.
Since human annotations are costly, it is important to investigate ways to generate distant supervision signals for spatial reasoning. However, unlike conventional distant supervision approaches (e.g., Mintz et al. (2009);Zeng et al. (2015); Zhou et al. (2020)) where distant supervision data can be selected from large corpora by implementing specialized filtering rules, spatial reasoning does not appear often in existing corpora. Therefore, similar to SPARTQA-HUMAN, we take advantage of the ground truth of NLVR images, design CFGs to generate stories, and use spatial reasoning rules to ask and answer spatial reasoning questions. This automatically generated data is called SPARTQA-AUTO, and below we describe its generation process in detail.
Story generation Since NLVR comes with structured descriptions of the ground truth locations of those objects, we were able to choose random blocks and objects from each image programmatically. The benefit is two-fold. First, a random selection of blocks and objects allows us to create multiple stories for each image; second, this randomness also creates spatial reasoning opportunities with missing information.
Once we decide on a set of blocks and objects to be included, we determine their relationships: Those relationships between blocks are generated randomly; as for those between objects, we refer to the ground truth of these images to determine them. Now we have a scene containing a set of blocks and objects and their associated relationships. To produce a story for this scene, we design CFGs to produce natural language sentences that describe those blocks/objects/relationships in various expressions (see Fig. 3 for two portions of our CFG describing relative and nested relations between objects).

The big black shape is above the medium triangle.
S <Article> <Object> is <Relation> <Article> <Object>.  Being grounded to visual scenes guarantees spatial coherency in a story, and using CFGs helps to have correct sentences (grammatically) and various expressions. We also design context-sensitive rules to limited options for each CFG's variable based on the chosen entities (e.g. black circle), or what is described in the previous sentences (e.g. Block A has a circle. The circle is below a triangle.) Question generation To generate questions based on a passage, there are rule-based sys- tems (Heilman and Smith, 2009;Labutov et al., 2015), neural networks (Du et al., 2017), and their combinations (Dhole and Manning, 2020). However, in our approach, during generating each story, the program stores the information about the entities and their relationships. Thus, without processing the raw text, which is error-prone, we generate questions by only looking at the stored data. The question generation operates based on four primary functionalities, Choose-objects, Describe-objects, Find-all-relations, and Find-similar-objects. These modules are responsible to control the logical consistency, correctness, and the number of steps required for reasoning in each question.
Choose-objects randomly chooses up to three objects from the set of possible objects in a story under a set of constraints such as preventing selection of similar objects, or excluding objects with relations that are directly mentioned in the text.
Describe-Objects generates a mention phrase for an object using parts of its full name (presented in the story). The generated phrase is either pointing to a unique object or a group of objects such as "the big circle," or "big circles." To describe a unique object, it chooses an attribute or a group of attributes that apply to a unique object among others in the story. To increase the steps of reasoning, the description may include the relationship of the object to other objects instead of using a direct unique description. For example, "the circle which is above the black triangle." Find-all-relations completes the relationship graph between objects by applying a set of spatial rules such as transitivity, symmetry, converse, inclusion, and exclusion on top of the direct relations described in the story. As shown in Fig. 4, it does an exhaustive search over all combinations of the relations that link two objects to each other.
Find-similar-objects finds all the mentions matching a description from the question to objects in the story. For instance, for the question "is there any blue circle above the big blue triangle?", this module finds all the mentions in the story matching the description "a blue circle".
Similar to the SPARTQA-HUMAN, we provide four Q-TYPEs FR, FB, CO, and YN. To generate FR questions, we choose two objects using Choose-objects module and question their relationships. The YN Q-TYPE is similar to FR, but the question specifies one relationship of interest chosen from all relation extracted by Find-all-relations module to be questioned about the objects. Since most of the time, Yes/No questions are simpler problems, we make this question type more complex by adding quantifiers (adding "all" and "any"). These quantifiers help to evaluates the models' capability to aggregate relations between more than two objects in the story and do the reasoning over all find relations to find the final answer. In FB Q-TYPE, we mention an object by its indirect relation to another object using the nested relation in Describe-objects module and ask to find the blocks containing or not containing this object. Finally, the CO question selects an anchor object (Choose-objects) and specifies a relationship ( using Find-all-relations) in the question. Two other objects are chosen as candidates to check whether the specified relationship holds between them and the anchor object. We tend to force the algorithm to choose objects as candidates that at least have one relationship to the anchor object. To see more details about different question' templates see Table  7 in the Appendix.
Answer generation We compute all direct and indirect relationships between objects using Findall-relations function and based on the Q-TYPEs generate the final answer.
For instance, in YN Q-TYPE if the asked relation exists in the found relations, the answer is "Yes", if the inverse relation exists it must be "No", and otherwise, it is "DK" 5 .

Corpus Statistics
We generate the train, dev, and test set splits based on the same splits of the images in the NLVR dataset. On average, each story contains 9 sentences (Min:3, Max: 22) and 118 tokens (Min: 66, Max: 274). Also, the average tokens of each question (on all Q-TYPE ) is 23 (Min:6, Max: 57). While the answer to a YN question is a single label chosen from Yes, No, and DK, FR questions can have multiple correct answers. Therefore, we treat each candidate answer to FR as an independent binary classification problem, and take the union as the final answer. As for YN, we choose the label with the highest confidence (Fig 8b).
As the candidate answers to FB and CO are not fixed and depend on each story and its question the input sequences to these Q-TYPEs are concatenated with each candidate answer. Since the defined YN and FR model has moderately less accurate results on FB and CO Q-TYPEs, we add a LSTM (Hochreiter and Schmidhuber, 1997) layer to improve it. Hence, to find the final answer, we run the model with each candidate answer and then apply an LSTM layer on top of all token representations. Then, we use the last vector of the LSTM outputs for classification (Fig 8a). The final answers are selected based on Eq. (1).
where s is the story, c i is the candidate answer, q is the question, [ ] indicates the concatenation of the listed vectors, and m i is tokens' number in x i . The parameter vector, W , is shared for all candidates.

Training and Inference
We train the models based on the summation of the cross-entropy losses of all binary classifiers in the architecture. For FR and YN Q-TYPEs, there are multiple classifiers, while there is only one classifier used for CO and FB Q-TYPEs. We remove inconsistent answers in postprocessing for FR and YN Q-TYPEs during inference phase. For instance on FR, left and right relations between two objects cannot be valid at the same time. For YN, as there is only one valid answer amongst the three candidates, we select the candidate with the maximal predicted probability of being the true answer.

Experiments
As fine-tuning LMs has become a common baseline approach to knowledge transfer from a source dataset to a target task, including but not limited to Phang et al. (2018); Zhou et al. (2020); He et al.
(2020b), we study the capability of spatial reasoning of modern LMs, specifically BERT, ALBERT, and XLNet, after fine-tuning them on SPARTQA-AUTO. This fine-tuning process is also known as further pretraining, to distinguish with the finetuning process on one's target task. It is an open problem to find out better transfer learning techniques than simple further pretraining, as suggested in He et al. (2020a);Khashabi et al. (2020), which is beyond the scope of this work. All experiments use the models proposed in Sec. 5. We use AdamW (Loshchilov and Hutter, 2017) with 2 × 10 −6 learning rate and Focal Loss (Lin et al., 2017) with γ = 2 for training all the models. 6 6.1 Further pretraining on SPARTQA-AUTO improves spatial reasoning Table 2 shows performance on SPARTQA-HUMAN in a low-resource setting, where 0.6k QA pairs from SPARTQA-HUMAN are used for fine-tuning these LMs and 0.5k for testing (see Table 1 for information on this split). 7 During our annotation, we found that the description of "near to " and "far  Table 2: Further pretraining BERT on SPARTQA-AUTO improves accuracies on SPARTQA-HUMAN. All systems are fine-tuned on the training data of SPARTQA-HUMAN, but Systems 3-5 are also further pretrained in different ways. System 3: further pretrained on the stories from SPARTQA-AUTO as a masked language model (MLM) task. System 4: further pretrained on both stories and QA annotations as MLM. System 5: the proposed model that is further pretrained on SPARTQA-AUTO as a QA task. Avg: The micro-average on all four Q-TYPEs.
from" varies largely between annotators. Therefore, we ignore these two relations from FR Q-TYPE in our evaluations.
In Table 2, System 5, BERT (SPARTQA-AUTO), is the proposed method of further pretraining BERT on SPARTQA-AUTO. We can see that System 2, the original BERT, performs consistently lower than System 5, indicating that having SPARTQA-AUTO as a further pretraining task improves BERT's spatial understanding.
Model  Table 3: Switching from accuracy in Table 2 to F 1 shows that the models are all performing better than the majority baseline on YN Q-TYPE.
In addition, we implement another two baselines. System 3, BERT (Stories only; MLM): further pretraining BERT only on the stories of SPARTQA-AUTO as a masked language model (MLM) task; System 4, BERT (SPARTQA-AUTO; MLM): we convert the QA pairs in SPARTQA-AUTO into textual statements and further pretrain BERT on the text as an MLM (see Fig. 5 for an example conversion).
To convert each question and its answer into a sentence, we utilize static templates for each question type which removes the question words and rearranges other parts into a sentence.
We can see that System 3 slightly improves over System 2, an observation consistent with many prior works that seeing more text generally helps an LM (e.g., Gururangan et al. (2020)). The signif-A big circle is above a triangle. A blue square is below the triangle.
What is the relation between the circle and the blue object? Answer: Above A big circle is above a triangle. A blue square is below the triangle. The circle is [MASK] the blue object. Answer: Above Figure 5: Convert a triplet of (paragraph, question, answer) into a single piece of text for the MLM task. icant gap between System 3 and the proposed System 5 indicates that supervision signals come more from our annotations in SPARTQA-AUTO rather than from seeing more unannotated text. System 4 is another way to make use of the annotations in SPARTQA-AUTO, but it is shown to be not as effective as further pretraining BERT on SPARTQA-AUTO as a QA task.
While the proposed System 5 overall performs better than the other three baseline systems, one exception is its accuracy on YN, which is lower than that of System 3. Since all systems' YN accuracies are also lower than the majority baseline 8 , we hypothesize that this is due to imbalanced data. To verify it, we compute the F 1 score for YN Q-TYPE in Table 3, where we see all systems effectively achieve better scores than the majority baseline. However, further pretraining BERT on SPARTQA-AUTO still does not beat other baseline systems, which implies that straightforward pretraining is not necessarily helpful in capturing the complex reasoning phenomena required by YN questions.
The human performance is evaluated on 100 ran-  Table 4: Spatial reasoning is challenging. We further pretrain three transformer-based LMs, BERT, ALBERT, and XLNet, on SPARTQA-AUTO, and test their accuracy in three ways: Seen and Unseen are both from SPARTQA-AUTO, where Unseen has applied minor modifications to its vocabulary; to get those Human columns, all models are fine-tuned on SPARTQA-HUMAN's training data. Human performance on Seen and Unseen is the same since the changes applied to Unseen does not affect human reasoning.
dom questions from each SPARTQA-AUTO and SPARTQA-HUMAN test set. The respondents are graduate students that were trained by some examples of the dataset before answering the final questions. We can see from Table 2 that all systems' performances fall behind human performance by a large margin. We expand on the difficulty of SPARTQA in the next subsection.

SPARTQA is challenging
In addition to BERT, we continue to test another two LMs, ALBERT and XLNet (Table 5). We further pretrain these LMs on SPARTQA-AUTO, and test them on SPARTQA-HUMAN (the numbers of BERT are copied from Table 2) and two held-out test sets of SPARTQA-AUTO, Seen and Unseen. Note that when a system is tested against SPARTQA-HUMAN, it is fine-tuned on SPARTQA-HUMAN's training data following its further pretraining on SPARTQA-AUTO. We use the unseen set to test to what extent the baseline models use shortcuts in the language surface. This set applies minor modifications randomly on a number of stories and questions to change the names of shapes, colors, sizes, and relationships in the vocabulary of the stories, which do not influence the reasoning steps (more details in Appendix C.1). All models perform worst in YN across all Q-TYPEs, which suggests that YN presents a more complex phenomena, probably due to additional quantifiers in the questions. XLNet performs the best on all Q-TYPEs except its accuracy on SPARTQA-HUMAN's YN section. However, the drops in Unseen and human suggest overfitting on the training vocabulary. The low accuracies on human test set from all models show that solving this benchmark is still a challenging problem and requires more sophisticated methods like considering spatial roles and relations extraction (Kordjamshidi et al., 2010;Rahgooy et al., 2018) to understand stories and questions better.
To evaluate the reliability of the models, we also provide two extra consistency and contrast test sets. Consistency set is made by changing a part of the question in a way that seeks for the same information (Hudson and Manning, 2019; Suhr et al., 2019). Given a pivot question and answer of a specific consistency set, answering other questions in the set does not need extra reasoning over the story.
Contrast set is made by minimal modification in a question to change its answer (Gardner et al., 2020). For contrast sets, there is a need to go back to the story to find the new answer for the question's minor variations (see Appendix C.2 for examples.) The consistency and contrast sets are evaluated only on the correctly predicted questions to check if the actual understanding and reasoning occurs. This ensures the reliability of the models. Table 5 shows the result of this evaluation on four Q-TYPEs of SPARTQA-AUTO, where we can see, for another time, that the high scores on the Seen test set are likely due to overfitting on training data rather than correct detection of spatial terms and reasoning over them.

Extrinsic evaluation
In this subsection, we take BERT as an example to show, once pretrained on SPARTQA-AUTO, BERT can achieve better performance on two extrinsic evaluation datasets, namely bAbI and boolQ.
We draw the learning curve on bAbI, using the original BERT as a baseline and BERT further pretrained on SPARTQA-AUTO (Fig. 6). Although both systems achieve perfect accuracy given large enough training data (i.e., 5k and 10k), BERT (SPARTQA-AUTO) is showing better scores given less training data. Specifically, to achieve an accuracy of 99%, BERT (SPARTQA-AUTO) requires   1k training examples, while BERT requires twice as much. We also notice that BERT (SPARTQA-AUTO) converges faster in our experiments. As another evaluation dataset, we chose boolQ for two reasons. First, we needed a QA dataset with Yes/No questions. To our knowledge boolQ is the only available one used in the recent work. Second, indeed, SPARTQA and boolQ are from different domains, however, boolQ needs multi-step reasoning in which we wanted to see if SPARTQA helps. Table 6 shows that further pretraining BERT on SPARTQA-AUTO yields a better result than the original BERT and those reported numbers in Clark et al. (2019), which also tested on various distant supervision signals such as SQuAD (Rajpurkar et al., 2016), Google's Natural Question dataset NQ (Kwiatkowski et al., 2019), and QNLI from GLUE (Wang et al., 2018).
We observe that many of the boolQ examples answered correctly by the BERT further pretrained on SPARTQA-AUTO require multi-step reasoning. Our hypothesis is that since solving SPARTQA-AUTO questions needs multi-step reasoning, finetuning BERT on SPARTQA-AUTO generally improves this capability of the base model.

Conclusion
Spatial reasoning is an important problem in natural language understanding. We propose the first human-created QA benchmark on spatial reasoning, and experiments show that state-of-the-art pretrained language models (LM) do not have the capability to solve this task given limited training data, while humans can solve those spatial reasoning questions reliably. To improve LMs' capability on this task, we propose to use hand-crafted grammar and spatial reasoning rules to automatically generate a large corpus of spatial descriptions and corresponding question-answer annotations; further pretraining LMs on this distant supervision dataset significantly enhances their spatial language understanding and reasoning. We also show that a spatially-improved LM can have better results on two extrinsic datasets (bAbI and boolQ).  Table 7 shows the templates used to create questions in SPARTQA-AUTO. The "<object>" is a variable replaced by objects from the story (using Choose-objects and Describe-objects modules), and the "<relation>" variable can be replaced by the chosen relations between objects (using Findall-relations module). The articles and the indefinite pronouns in each template play an essential role in understanding the question's objective. For example, "Are all blue circles near to a triangle?" is different from "Are there any blue circles near to a triangle?", and "Are there any blue circles near to all triangles?". Therefore, we check the uniqueness of the object definition, using "a" or "the" in proper places and randomly place the terms "any" or "all" in the YN questions to generate different questions. Table 8 shows the percentage of correct labels in train and test sets. In multi-choice Q-TYPEs, more than one label can be true. Table 10 shows some generated sentences in SPARTQA-AUTO with some specific features that challenge models to understand different forms of relation description in spatial language.

C Additional Evaluation Sets
Here we describe three extra evaluation sets provided with this dataset in more detail, including unseen test, consistency, and contrast sets.

C.1 Unseen Evaluation Set
We propose an unseen test set alongside the seen test of SPARTQA-AUTO to check whether a model is using shortcuts in the language surface by describing objects and relations with new vocabularies in the samples. This set has minor modifications that should not affect the performance of a consistent and reliable model. The modifications are randomly applied on a number of generated stories and questions and include changing names of shapes, colors, sizes, and relationships' names (describing relationships using different language expressions). The modification choices are described in Table 9.

C.2 Contrast and Consistency Evaluation
For probing the consistency and semantic sensitivity of models, we provide two extra evaluation test sets, Consistency and Contrast 9 .
Consistency set is made by changing parts of the question in a way that it still asks about the same information (Hudson and Manning, 2019; Suhr et al., 2019). For instance, for the question, "What is the relation between the blue circle and the big shape? Left," we create a similar question in the form of "What is the relation between the big shape and the blue circle? Right". Answering these questions around a pivot question is possible for human without the need for extra reasoning over the story and based on the main questions' answer. Hence, the evaluation on this set shows that models understand the real underlying semantics rather than overfit on the structure of questions.
Contrast set: This set is made by minor changes in a question that changes the answer (Gardner et al., 2020). As an instance, in the question "Is the blue circle below the black triangle? Yes," we create a contrast question "Is the blue circle below all triangles? No" by changing "the black trinagle" to "all triangles". The evaluation on this set shows the robustness of the model and its sensitivity to the semantic changes when there are minor changes in the language surface 10 .

D Extra Annotations
Alongside the main SPARTQA-AUTO's stories and questions we provided some extra annotation to help the models to understand the spatial language better.

D.1 Detailed Annotation and Scene-Graphs
Providing in-depth human annotations is quite expensive and time-consuming. In SPARTQA-AUTO, we generated fine-grained scene-graph based on the story. This scene-graph contains blocks' description, their relations, and the objects' attributes alongside their direct relations with each other. The scene-graphs can be used for the models to understand all spatial relations directly mentioned in the textual context. Figure 7 shows an example of this scene-graph. The scene-graph can provide strong supervision for question answering challenges and   can be used to evaluate models based on their steps of reasoning and decisions.

D.2 SpRL Annotation
We also provided spatial annotations for each sentence and question, based on Spatial Role Labeling (SpRL) annotation scheme (Kordjamshidi et al., 2010) (Fig. 11). This annotation is generated by hand-crafted rules during the main data generation. SpRL is used for recognizing spatial expressions and arguments in a sentence. This annotation is useful for applications that need to detect and reason about spatial expressions and arguments.

E QA Language Models for Spatial Reasoning over Text
Figures 8a and 8b depict the architecture used for further fine-tuning language models on SPARTQA described in section 5. Figure 9 shows an example of the bAbI dataset (Weston et al., 2015) task 17.

F bAbI and boolQ Datasets
To solve task 17 of bAbI , we implement two SpRL+rule-based and neural network models. The    Table 8: The percentage of each correct label in all samples. *The candidate answers for the FB Q-TYPE can be varied, based on its story. **CO can be considered as a multiple choice or single choice question. E.g., in "which object is above the triangle? the blue circle or the black circle?" you can consider two labels with boolean classification on each "blue circle" and "black circle" or consider it as a four labels classification: "blue circle," "black circle," "both of them," and "None of them." *** DK, None, [], all mean none of the actual labels are correct. tial relation triplets (Landmark, Spatial-indicator, trajector) for each fact in a story the applies spatial rules over these extracted triplets and report all possible relations between two asked objects. Finally, it checks whether the asked relation existed in the find relation. This model solves task 17 of the bAbI with 100% accuracy. To implement the neural network approach, we use huggingface implementation of pre-trained BERT (Devlin et al., 2019). We apply a boolean classifier on the output of "[CLS]" token from the last layer of BERT model for each "Yes" and "No" answers (the same as model used on YN question types.) We use Adamw (Loshchilov and Hutter, 2017) optimizer and 2e − 6 learning rate with negative log-likelihood loss objective and train the model on the 10k, 5k, 2k, 1k, 500, and 100 portion of bAbI's training questions. The model yields 100% accuracy on 10k, and 5k and 99% accuracy Little, Midsize, Large Table 9: Modifications on the unseen set on 2k and 1k training samples. Figure 10 shows an example of boolQ dataset. To Answering the questions of this dataset, we use the same setting as neural network model on bAbI to further fine-tune BERT on boolQ.

Examples
Features Block A is above Block C and B.
Using conjunction to describe relation between more than two blocks. The small circle is above the yellow square and the big black shape.
Using conjunction to describe relationships between more than two objects. The yellow square number one is to the right of and above the blue circle.
Using conjunction for more than one relation. Block B has two medium yellow squares and two blue circles.
Describing a group of objects with the same properties. In the next sentences, they are mentioned by an asigned number. For example, the blue circle number two. The blue circle is below the object which is to the right of the big square.
Using nested relations between objects in their description. A small blue circle is near to the big circle. It is to the left of the medium yellow square.
Using coreferences for an entity described in the previous sentences. There is a block named A. One small yellow square is touching the bottom edge of this block.
The verb matches the number of the subject.
What is the relation between black object and a big circle?
Using shape, object, and thing, which are a general description of an object. It could be the "black triangle" or the "black circle" mentioned in the story.