Teaching Machine Comprehension with Compositional Explanations

Advances in machine reading comprehension (MRC) rely heavily on the collection of large scale human-annotated examples in the form of (question, paragraph, answer) triples. In contrast, humans are typically able to generalize with only a few examples, relying on deeper underlying world knowledge, linguistic sophistication, and/or simply superior deductive powers. In this paper, we focus on “teaching” machines reading comprehension, using a small number of semi-structured explanations that explicitly inform machines why answer spans are correct. We extract structured variables and rules from explanations and compose neural module teachers that annotate instances for training downstream MRC models. We use learnable neural modules and soft logic to handle linguistic variation and overcome sparse coverage; the modules are jointly optimized with the MRC model to improve final performance. On the SQuAD dataset, our proposed method achieves 70.14% F1 score with supervision from 26 explanations, comparable to plain supervised learning using 1,100 labeled instances, yielding a 12x speed up.


Introduction
Recent advances in neural sequence learning and pre-trained language models yield strong (humanlevel) performance on several reading comprehension datasets (Lan et al., 2020;Raffel et al., 2019). However, state-of-the-art results mainly rely on large-scale annotated corpora, which are often timeconsuming and costly to collect (Rajpurkar et al., 2016). This often leads to a large gap between methods in the research settings and practical use cases, as large amounts of annotated data rarely exist for a new task or a low-resource domain (Linzen, 2020). To reduce this dependency on annotation efforts, we seek to improve the efficiency in obtaining and applying human supervision. Table 1: Key elements in proposed work. Semi-structured explanations characterize why an answer is correct and summarize the human's deductive process. Strictly and softly matched instances are automatically generated from explanations and provide supervision for training MRC models.
One strength of human cognition is the ability to generalize from relatively few examples; shown only a few instances of a problem and solution, humans often deduce patterns more readily than a machine, typically by bringing to bear a wealth of background information about what "really matters" in each example (DeJong and Mooney, 1986;Goldwasser and Roth, 2014;Lake et al., 2019). This ability to quickly abstract "deduction rules" is the inspiration for this work, and we aim to gather these rules in the form of semi-structured explanations.
In this paper, we focus on the extractive machine reading comprehension (MRC) task, where the system is given a query and is asked to identify an answer span from a particular paragraph. Previous work soliciting explanations as part of the annotation process has been limited to classification tasks (Hancock et al., 2018;. However, MRC is more challenging, since it lacks explicit anchor words (e.g., subject and object in relation extraction), has no pre-defined set of labels, and there is sparser coverage for each explanation. Q: What is the atomic number for Zinc? C: … Zinc is a chemical element with symbol Zn and atomic number 30. … A: 30 Explanation: X is atomic number. Y is Zinc. The question contains "number" so the answer should be a number. The answer is directly after X. "for" is directly before Y and directly after X in the question.

Crowd-sourcing Explanation
Neural Module Teacher  Figure 1: Overview of proposed work. We first collect a small set of semi-structured explanations, from which we extract key information such as variables and rules. These structured results are formulated into programs called neural module teachers (NMTeachers), which we use to curate supervision for training machine reading comprehension models.
To tackle these challenges, we propose the concept of a Neural Module Teacher (NMTeacher)an executable program constructed from humanprovided, semi-structured explanations that is (1) dynamically composed of modules based on the explanation; (2) capable of taking sequential steps and combinatorial search; and (3) capable of fuzzy matching using softened constraints. Fig. 1 shows an overview of our approach. We first use a Combinatory Categorial Grammar parser (Zettlemoyer and Collins, 2005) to turn explanations into structured variables and rules (Sec. 3.2). A neural module teacher is constructed with basic learnable modules (Sec. 3.1) based on parsing results and functions as a weak model for the specific type of question described in the explanation (Sec. 3.3). All neural module teachers act together and identify strictly-and softly-matched instances from an unlabeled corpus, which are used to train a downstream "student" MRC model (Sec. 4.2). It is important to note that while this work is tied to the particular task of MRC, we believe it can be extended to a wide range of NLP tasks.
We evaluated our approach on two datasets in MRC setting: SQuAD v1.1 (Rajpurkar et al., 2016) and Natural Questions (Kwiatkowski et al., 2019). Experimental results show the efficiency of the proposed approach in extremely low-resource scenarios. Using 26 explanations gathered in 65 minutes, NMTeacher achieves 56.74% exact match and 70.14% F1 score on the SQuAD dataset, while the performance is 9.71% and 16.37% with traditional annotation using the same amount of time. Moreover, our analysis shows that explanations continue to improve model performance when a mediumsized annotated dataset is readily available.

Problem Formulation
Our goal is to efficiently train an extractive MRC model F, which takes as input a tuple (q, c) of question q and context c, and extracts an answer span a within the context c. We assume a lowresource situation where a large set S of (q, c) pairs (without answer annotation) already exists, but we are limited in time to annotate only a small subset S o (< 200 instances) of S.

Overview and Notations.
We provide an overview of our proposed method in Fig. 1. First, we collect an answer a i and an explanation e i for each (q i , c i ) instance in S o , resulting in an updated S o = {(q, c, a, e)}. A neural module teacher G i will be constructed from each explanation e i , enabling it to answer questions similar to (q i , c i ). All neural module teachers acting together can be viewed as an ensemble teacher G. We then apply G to unannotated (q, c) pairs in S, getting S a = {(q, c, a)}, a strictly-labeled dataset that G can directly answer. The remaining unmatched instances are denoted as S u = {(q, c)}. After softening the constraints in each G i , we get a noisily-labeled dataset S p = {(q, c, a, z)} from S u , where z is a confidence score given by G. Notably, we will refer to the (q i , c i , a i ) part in (q i , c i , a i , e i ) ∈ S o as the "reference instance" for explanation e i , since we will frequently check (q i , c i , a i ) "for reference" when we apply G i to new, unseen instances.
S a and S p are significantly larger in size than S o and thus provide more sufficient supervision. We use S a and S p to train a downstream MRC model F. We denote this method as NMTeacher-DA. We further explore several variants, such as (1) leveraging S u with semi-supervised methods; and (2) joint training of G and F. We construct our final model NMTeacher-Joint by incorporating these variants. Note that our approach is modelagnostic so that F can take any trainable form.

Neural Module Teacher
A neural module teacher (NMTeacher) acts as a program that tries to answer questions following an explanation. In this section, we introduce the basic modules used for rule execution (Sec. 3.1), discuss how variables and rules are obtained from explanations (Sec. 3.2), and present how a neural module teacher derives answers (Sec. 3.3).

Atomic Modules
We define four types of atomic modules that can be composed to create neural module teachers: FILL, FIND, COMPARE and LOGIC. Each can support strict and softened matching criteria as a part of generating training instances for downstream use. We summarize their usage in Table 2 and introduce them in detail in the following.
FILL. When humans encounter a new question, they can detect structural similarities to previous questions. For example, humans will note that How is hunting ✿✿✿✿✿✿✿✿ regulated? is structually similar to How is packet switching ✿✿✿✿✿✿✿✿✿✿✿✿ characterized?, enabling them to infer that answers to both might have a similar structure (e.g., by doing sth...). To mimic this human intuition, we design a FILL module: given a sentence s ref and a span of interest p ref , FILL will predict analogous spans p in a new sentence s.
The strict version of FILL outputs spans p whose named entity type, dependency parse structure, or constituent parse structure 1 matches p ref . We encourage over-generation, since the rule execution step later on will verify each candidate. When strict matching produces nothing, we employ softened matching techniques. Here, we first produce a contextualized phrase representation e ′ for p ref . We rank each candidate constituent p in sentence s according to the similarity between e ′ and an analogous phrase representation e for p. We return the top k such constituents along with their score.
To generate phrase representations, we first encode the sentence with BERT-base model (Devlin et al., 2019) and get representations [h 1 , h 2 , ..., h m ] for each token. We then apply pooling over all tokens in span p to get the phrase representation e. We considered both mean pooling and attentive pooling (Bahdanau et al., 2014). The similarity score between e and e ′ can be calculated using either cosine similarity or learned bilinear similarity, i.e., Sim(e, e ′ ) = tanh(eAe ′ + b), where A is a learnable matrix. We discuss pretraining and design choices for softened FILL module in Sec. 4.1.
FIND. The FILL module finds a span p that plays the same role as p ref in its containing sentence. In contrast, FIND looks for a span p that has the same meaning as p ref . For instance, if a query mentions the explosion, we might want to identify exploded as its counterpart in the paragraph being searched for an answer. This module is similar to the find module in Jiang and Bansal (2019) in its motivation, while we design ours to be a rankingbased module with discrete boundaries, so that the output fits in the search procedure in Sec. 3.3.
The strict version of FIND module directly looks for exact matches of the key p ref . To account for synonyms, co-reference, and morphological/spelling variation, we also build a softened version using the same model structure as the FILL module. We discuss the design choices and training for the softened FIND module in Sec. 4.1.
COMPARE. In our annotation guidelines, we encourage annotators to describe the relative location of spans in their explanations, e.g., X is within 3 words after Y. The COMPARE module executes such distance comparisons. The strict version requires the condition to be met exactly: P (d 1 ≤ d 0 ) = 1 when d 1 ≤ d 0 , and P (d 1 ≤ d 0 ) = 0 otherwise. In the softened version, we attempt instead to indicate how close d 1 ≤ d 0 is to being true: (1) As an example, P (1 ≤ 0) = 0.75 (a near miss) but P (5 ≤ 0) = 0 (due to the max in Eq. (1)).
LOGIC. The logic operations "and" and "or" often appear in explanations. A single explanation may also contain multiple sentences, requiring a logical AND to aggregate them. In the strict version of LOGIC, only boolean outputs of True (1) and False (0) are allowed. In the softened version, we use soft logic to aggregate two probabilities, i.e., AND(p 1 , p 2 ) = max(p 1 + p 2 − 1, 0) and OR(p 1 , p 2 ) = min(p 1 + p 2 , 1).

Parsing Explanations to Executable Rules
When soliciting explanations, we encourage annotators to think of each explanation as a collection of variables and rules. This framing allows us to effectively transform these explanations into executable forms. We formally define the terms here: Variables are phrases that may be substituted in a question or answer when generalizing to unseen instances. In Table 1, underlined and colored phrases are all considered variables. Annotators are guided to mark these spans explicitly, e.g., X is funeral. Y is held. X is within 5 words of Y. Variables are closely related to the design of the FILL module since FILL aims to propose potential assignments to these variables when it is given unseen instances.
Rules are statements that describe the characteristics of variables and relationships between them. When all variables in a rule are assigned, execution of a rule will output either True or False (strict) or a score between 0 and 1 (softened). Following previous work (Srivastava et al., 2017;, we first use a Combinatory Categorial Grammar (CCG) based semantic parser P (Zettlemoyer and Collins, 2005) to transform explanations into logical forms (e.g., from e to p j in Table 3). We build a domain-specific lexicon for common expressions used in explanations. We then implement the operation for each supported predicate (e.g., "@Is", "@Direct", "@Left"), which may internally call atomic modules described in Sec 3.1. These predicate implementations, together with the inherent λ-calculus hierarchy from CCG, will yield the final executable function f j as shown in Table 3.

Extracting Answer Spans
Rules introduced in Sec 3.2 can be executed to verify whether variable assignments are correct. In other words, given a (q, c, a) triple, executing all rules will give a boolean value (strict) or a confidence score (softened) indicating the triple's correctness. To actively output an answer, we need to re-formulate the problem so that each neural module teacher G i takes (q, c) as input and gives an answer span a and confidence score z as output.
To apply explanation e i to a new question, candidates for each variable are first proposed by the FILL module. We then look for the best combination of variable assignments (achieving highest confidence) when evaluated using the rules generated from e i . As a minimal example, if FILL proposes {x 1 , x 2 } as potential assignments to variable X, and {a 1 , a 2 } to ANS, we evaluate the four possible combinations {(x 1 , a 1 ), (x 2 , a 1 ), (x 1 , a 2 ), (x 2 , a 2 )} by applying e i and select the one achieving the highest confidence score. As the number of combinations expands significantly with the number of variables and their candidates, we solve this problem with beam search, progressively filling each variable and in each step keeping the most promising combinations (see Figure 6 and Algorithm 2 in Appendix for more details). By doing so, we have completed our construction of neural module teacher G i from one semi-structured explanation e i . We use G i (q, c) = (a, z) to denote that given question q and context c, neural module teacher G i identifies the answer span a with a confidence score z. Multiple neural module teachers G i may ensemble into G by listing answer spans outputted by each G i and selecting the one with the highest z.

Pre-training the Fill and Find Module
The softened FILL module is pre-trained with pairs of (positive) matches (q ref , s ref , q, s) from strictlymatching results S a , including 99153 questions and 55202 contexts, divided into 70% train, 10% dev and 20% test datasets. We use random constituents in the sentence as negative training examples. For the FILL module, we evaluated various model de-
The softened FIND module assesses semantic similarity of phrases. We tried various datasets as proxies for pre-training this ability, including coreference resolution results on SQuAD corpus (produced by Stanford CoreNLP (Manning et al., 2014)) and paraphrase dataset (PPDB (Pavlick et al., 2015)). We manually evaluated FIND module performance with S o , and we observe that using mean pooling and cosine similarity without any pre-training yields the best performance. We conjecture this may be caused by data bias (the training data not aligning with the purpose of the module). Therefore, we use untrained BERT-base as our FIND module to capture semantic similarities. We leave manual evaluation results in Appendix B.

Training the MRC Model F
Our learning framework (Algorithm 1) uses our ensemble neural module teacher G to answer each (q, c) instance in S, resulting in three splits of data instances: a strictly-matched set S a , a softlymatched dataset S p and an unlabeled set S u . We use these three sets to jointly learn our downstream MRC model and NMTeacher, as described below.
Learning from Strictly-matched Data S a . We start by simply treating S a as a labeled dataset, and first train the downstream MRC model F with traditional supervised learning. We compare different MRC models in our experiments. For simplicity, we denote MRC Loss(B (i) ) as the loss term defined in these MRC models for the i-th instance in batch B. In each step, we sample a batch B a from S a and update the model with loss term L(B a ): Learning from Softly-matched Data S p . The softly-matched set S p is significantly larger in size (than S a ) and may contain useful information for training F. We blend in supervision from S p by adding a weighted loss term to the original loss L(B a ). That is, we simultaneously sample a batch B a from S a and a batch B p from S p . The loss term for B p is weighted and normalized by the confidence score z from NMTeacher G, where θ t in Eq. 3 is a temperature that controls the normalization intensity. We then aggregate the loss terms from S p and S a with coefficient β, i.e., L ap = L(B a ) + βL(B p ). We denote the method up to this step as NMTeacher-DA.
Learning from Unlabeled Data S u . We further learn from unlabeled data in S u by integrating existing semi-supervised methods. In brief, pseudo labeling (PL) samples a batch B u from S u , annotates it with the current MRC model F, and calculates the loss term on this pseudo-labeled batch B u . The overall loss L term thus becomes L au = L(B a ) + βL(B u ). To mix in supervision from unlabeled data, we formulate a r + 1 rotation between sampling unlabeled batch B u and softlymatched batch B p ; we update MRC model F for r steps using the semi-supervised method and loss term L au , and then update the model for one step using softly-matched data and the loss term L ap .
Joint Training. Instance weight w i (Eq. 3) for each softly-labeled instance in batch B p is calculated with NMTeacher G, so we further allow gradient backpropagation to trainable FILL and FIND modules in G when optimizing loss term L au . We fix G at first and allow joint training after training on F converges. This helps form consensus between NMTeacher G and the learned downstream MRC model F, which we believe is helpful in denoising and refining the final MRC model. We denote this final method as NMTeacher-Joint.   To keep consistent with our settings, we assume "the long answer is given, and a short answer is known to exist" and preprocess NQ into the same format as SQuAD. We discard instances whose (1) long answer is not free-form text (e.g., table, list); or (2) short answer contains multiple short spans.
Evaluation. Use of the official SQuAD and NQ test sets is restricted, so we construct our own dev and test sets by splitting the official dev sets in half. 2 Hyper-parameters and the best checkpoint are selected on the dev set. We use the SQuAD official evaluation script and report Exact Match (EM) and F1 score on both the dev set (in Appendix) and test set (in Sec 5.2). Note that this is different from the long-/short-answer metrics for NQ official evaluation. We report 3-run average and standard deviation using 3 different random seeds.
MRC Models. Importantly, our approach is model-agnostic. We test our framework using the following three models as MRC model F. Semi-supervised Methods. We compare and enhance NMTeacher with the following semisupervised methods: (1) Self Training (ST) (Rosenberg et al., 2005) iteratively annotates unlabeled instances with maximal confidence in each epoch; (2) Pseudo Labeling (PL) (Lee, 2013) trains a weak model on labeled data first and annotates unlabeled batches as supervision. (3) Mean Teacher (MT) (Tarvainen and Valpola, 2017) introduces consistency loss between a student model and a teacher model (the exponential moving average of student models from previous steps).
Explanation Collection.  Table 6: Performance comparison on NQ using 18/36/54 explanations. Similar trends as in Table 4 can be observed.
annotate an answer and 151 seconds to annotate both an explanation and an answer (3.5x slower compared to annotating answer only).

Performance Comparison
Main Results. Tables 4 and 6 show results of different MRC models, with different numbers of explanations used. The baseline for each model uses as training the strictly-matched instances (S a ) generated using the explanations. For all models, performance then improves when we include the softly-matched instances (S p ). We show in Fig. 2 that this pattern largely continues even as we further increase the number of explanations, showing that noisy labels are of highest value in low-resource settings but still continue to provide value as training sizes increase. In most cases, performance improves further when trained with semi-supervised learning and S u . Finally, performance is best when we make full use of S a , S p and S u , and jointly train F and G (NMTeacher-Joint).
Efficiency Study. We demonstrate NMTeacher's efficiency by controlling annotation time. Given a fixed amount of time t, we denote S

Performance Analysis
Matching Noise/Bias. Our proposed method hypothesizes new training examples, which may be noisy even when "strictly matched". The matched instances may also be more similar than desired to the reference instances. To assess the impact of these two factors, we look at the strictly-matched set S a and the softly-matched set S p generated with 52 SQuAD explanations. We define S * a and S * p , versions of these sets with human-annotated answers (i.e., no noise). We then train an ALBERT-b model with supervision in the following six settings: (1) S a ; (2) S a and S p ; (3) S * a ; (4) S * a and S * p ; (5) S r , a set of randomly sampled SQuAD training instances with size |S a |; (6) S r of size |S a | + |S p |. Results are listed in Table 7. Comparing (1) and (3), we observe a 6.16% F1 gap caused by noise in strict matching; Comparing (2) and (4) Figure 3: Study on Annotation Efficiency. We compare model performance when annotation time is held constant; NMTeacher-Joint consistently outperforms the baseline without explanations (e.g., 70.14% vs. 16.37% F1 score with 65 minutes annotation). BERT-l is used as MRC model. gap is further widened, since there are more noises in softly-matched data. Comparing (3) and (5), we see a 7.15% F1 gap mainly caused by bias in the instances matched by NMTeachers. We believe addressing these two issues will improve model performance, and we leave this as future work.
Medium and High Resource Scenarios. Going beyond low-resource scenarios, we examine NMTeacher's capability in medium-and highresource scenarios. Similar to the few-shot evaluation in Lewis et al. (2019), we randomly sample different number of human-annotated instances from SQuAD as S r . The size of S r range from 100 to 80k. We train a BERT-l MRC model using S r along with S a , S p generated with 52 explanations. We compare with training the MRC model using S r only. Fig. 4 shows that when a medium-size S r is readily available (|S r | < 5k), augmenting it with NMTeacher is still beneficial. In practice, this could be particularly useful when a defect is observed in the trained model (e.g., a certain type of question is answered poorly). A small set of explanations could be collected rapidly and used by NMTeacher to remedy the defect. Benefits brought by NMTeacher become marginal when labeled data set become larger (|S r | > 10k).
Ablation Study on Modules. To evaluate the effect of the softened module execution, we progressively turn on the softened version of FIND, FILL and COMPARE in NMTeacher matching process, use matched data to train the downstream MRC model F in NMTeacher-DA setting, and report the final performance. The evaluation results are presented in Fig. 5. Results show that softening each module contributes to performance improvement.
Additional Analysis. We refer readers to Appendix B for additional matching quality analysis, and manual evaluation of trainable modules.

Discussion
Assumptions on Unlabeled Data. In Sec. 2 we assumed that a large set S of (q, c) pairs (without answer annotation) is readily available. We acknowledge that annotators for SQuAD dataset are shown only context c and then required to provide (q, a) pairs, so that (q, c) pairs are not free. However, the curation of Natural Questions starts with users' information-seeking questions and draws support from information retrieval to get (q, c) pairs. In this case our method has its practical value. We consider SQuAD as a testbed for our approach, while NQ fits the assumptions better.
Design efforts and time cost. Our approach highlights efficiency during annotation, while the efforts in designing are not taken into account. We agree these efforts are non-trivial, yet they're hard to quantify. We're optimistic about efficiency since these efforts will be amortized when our approach is reused or extended to other datasets/tasks. In our study, we started with building lexicons and modules for SQuAD, but we didn't make additional efforts when we adapted to NQ. This demonstrates flexibility across different datasets. To extend our work to new tasks, some components in our study may be reused, and we hope users can learn from our experience to expedite their customization.
Results with 36/54 explanations on NQ. It is observed that on NQ dataset (Tabel 6), using 36 and 54 explanations both achieves 41% F1 score. We conjecture part of the reason to be random subsampling of expalantions from a larger pool, since (1) each explanation has different representation power and generalization ability; (2) subsampled explanations could describe similar things and lack diversity. Our discussion on matching quality/bias (Sec. 5.3) may also account for this. We think ensuring diversity during explanation collection and enforcing instance weighting during training may help alleviate these issues, but will leave this as future work.

Related Work
Learning with Explanations. Srivastava et al. (2017) first propose to use explanations as feature functions in concept learning. Hancock et al. (2018) proposed BABBLELABBLE for training classifiers with explanations in data programming setting, which uses explanations to provide labels instead of features.  proposed NEXT to improve generalization of explanations with softened rule execution. Both BABBLELAB-BLE and NEXT highlight annotation efficiency in low-resource settings. To the best of our knowledge, we are the first to study soliciting explanations for MRC, which is intrinsically more challenging than classification tasks in existing works. Concurrent with our work, Lamm et al. (2020) proposed QED, a linguistically-grounded framework for QA explanations, which decomposes the steps to answer questions into discrete steps associated with linguistic phenomena. Related to our work, Dua et al. (2020) collect context spans that "should be aggregated to answer a question" and use these annotations as auxiliary supervision.
Learning from Unlabeled data. A notable line of work focuses on enforcing consistency on unlabeled data by regularizing model predictions to be invariant to noise-augmented data (Xie et al., 2019;Yu et al., 2018). Consistency can also be enforced through temporal ensemble (Laine and Aila, 2017;Tarvainen and Valpola, 2017). Another line of work uses bootstrapping -first training a weak model with labeled data; then use model prediction on unlabeled data as supervision (Carlson et al., 2009;Yang et al., 2018a). Our proposed method is non-conflicting with semi-supervised strategies and we enhance NMTeacher with these strategies to achieve the best performance.
Neural Module Networks. Neural module networks (NMNs) are dynamically composed of individual modules of different capabilities. It was first proposed for VQA tasks (Andreas et al., 2016b,a;Hu et al., 2017). Recently in NLP community, reading comprehension requiring reasoning (Yang et al., 2018b;Dua et al., 2019;Amini et al., 2019) are proposed and widely studied. Recent works (Jiang and Bansal, 2019; Gupta et al., 2020) generally adopt a parser that gives a sequence of operations to derive the final answer. Our work differs in that (1) operations are constructed from explanations instead of questions; (2) NMTeacher provides supervision, instead of being used as final MRC model and trained in a fully-supervised manner. We limit our scope to SQuAD-style MRC tasks in this paper and leave other challenging tasks as future work. (2) Learning Efficiency: a model learns quickly with minimal supervision (Radford et al., 2019;Chan et al., 2019); (3) Annotation Efficiency: to create a dataset efficiently with time limit/budget; our work falls into this category. We believe these perspectives are non-conflicting with each other. It would be interesting to see whether and how methods from these perspectives can be integrated, and we leave this as future work.

Conclusion
In this paper, we propose to teach extractive MRC with explanations, with a focus on annotation efficiency. We believe explanations stating "why" and justifying "deduction process" opens up a new way to communicate human's generalization abilities to MRC model training. We begin with a small set of semi-structured explanations and compose NMTeachers to augment training data. NMTeachers are modularized functions where each module has a strict and softened form, enabling broader coverage from each explanation. Extensive experiments on different datasets and MRC models demonstrate the efficiency of our system. Having achieved encouraging results for MRC, we look forward to extending this framework to tasks such as non-fact-based QA and multi-hop reasoning.

A Case Study
Strictly-matched instances. Table 8 shows two examples of strictly-matched instances. In the first example, the explanation specified how to answer questions similar to "In what year did X (sth.) begin". Intuitively, the answer should be a year number right after "since", and the entity before "begin" should be a keyword. In the second example, questions following the pattern "when was X (sth.) Y (done)" are explained and the answer is typically a date after "on". Also, the verb "done" should be directly before "on" and the answer.
Softly-match Instances. Table 9 shows two examples of softly-matched instances. In the first example, the distance between Y and Z is three in the question, while the explanation specifies there should be less than two words between them. With COMPARE module, the correct answer is found with high confidence of 97.22%. In the second example, the explanation specifies Y to be an adjective phrase. With FILL module, a verb in the past tense, "purified", is also listed as a potential fit for variable Y, and this gives the correct answer "a secret lake" with a confidence score of 72.48%.

Reference Instance Q:
In what year did Film Fest New Haven begin? C: ... The Film Fest New Haven has been held annually since 1995.
A: 1995 Semi-structured Explanation X is "Film Fest New Haven". The question starts with "In what year", so the answer should be a year. "begin" is in the question. X is directly after "did" and directly before "begin" in the question. "since" is directly before the answer. Semi-structured Explanation X is "funeral". Y is " ✿✿✿✿ held". In the question X is within 4 words after "when was" and Y is directly after X. "on" is directly before the answer. Y is within 2 words before the answer. X is within 3 words left of Y. The question starts with "when", so the answer should be a date.  against". Z is "1343". In the question, Y is directly after X and Z is within 2 words after Y. Z is a year. The answer directly follows Y. X is within 3 words before Y. Softly-matched Instance Q: The Slavs ✿✿✿✿✿✿✿✿✿ appeared ✿✿✿✿ on whose borders around the 6th century? C: ... Around the 6th century, Slavs ✿✿✿✿✿✿✿✿✿ appeared ✿✿✿ on Byzantine borders in great numbers. ... A: Byzantine borders (Confidence z = 97.22%) Note Z (the 6th century) is 3 words after Y (appeared on) in the question, which slightly breaks the constraint "Z is within 2 words after Y". This is captured by COMPARE module.
Reference Instance Q: Where is hydrogen ✿✿✿✿✿✿ highly ✿✿✿✿✿✿✿ soluble? C: ... Hydrogen is ✿✿✿✿✿✿ highly ✿✿✿✿✿✿✿ soluble in many rare earth and transition metals and is soluble in both nanocrystalline and amorphous metals. ... A: many rare earth and transition metals Semi-structured Explanation X is "hydrogen". Y is " ✿✿✿✿✿✿ highly ✿✿✿✿✿✿✿ soluble". Y is directly after X and X is directly after "where is" in the question. X is within 5 words before Y. Y is within 2 words before the answer. "in" directly before the answer. "is" is between X and Y. Softly-matched Instance Q: Where is the divinity herself ✿✿✿✿✿✿✿✿ purified? C: ... Afterwards the car, the vestments, and, if you like to believe it, the divinity herself, are ✿✿✿✿✿✿✿✿ purified in a secret lake. ... A: a secret lake (Confidence z = 72.48%) Note In the reference instance, Y (highly soluble) is supposed to be an adjective phrase. In the new instance, FILL module suggested "purified" to be a promising candidate for variable Y.

B Additional Performance Analysis
Performance of Fill and Find module The FILL module is evaluated on the test split of hardmatched question pairs and context pairs, as described in Sec. 4.1. The FIND module is evaluated through manual inspection on model's predictions on instances in S o . For each sentence in the test set, we enumerate all possible constituents, let the model rank these spans. We take top-n (n = 1, 3, 5, 10 for FILL module and n = 1 for FIND module) spans as output. We use recall (at n) r n = p q as metric for evaluation, where p is the number of correct spans found in top-n outputs and q is the number of all correct spans. Evaluation results for Fill and Find module are shown in Table 10. As n gets large, the top-n outputs from the FILL module are able to identify most of the correct spans.   (4.35%). This observation demonstrates the explanations we collect cover a wide range of question types. Yet, the distribution of input data has far more aspects than question heads. Our current implementation and design may not explain complex questions that require multi-step reasoning abilities, and this may result in strong biases in S a and S p .
To examine the labeling accuracy, we directly evaluate annotations obtained with the neural module teacher G against human annotations. On SQuAD with 52 explanations, 72.19% EM and 83.35% F1 is achieved on the 766 strictly-matches instances in S a . Noises in annotations generated with neural module teachers G will also cause performance downgrade in the final model F; and thus denoising matched instances will help improve performance. Joint training may partially resolve this by encouraging consensus between G and MRC model F; meanwhile we encourage future research in this direction.

C Beam Search Algorithm for Neural Module Teacher
In Sec. 3.3 we mentioned the usage of beam search algorithm to search for the best combination of variable assignments. We provide the details in the Algorithm 2.

D Reproducibility
Computing Infrastructure. Based on GPU availability, we train our models on either Quadro RT 6000, GeForce RTX 2080 Ti or GeForce GTX 1080 Ti. All of our models are trained on single GPU. NMTeacher-Joint requires optimizing both NMTeacher modules and MRC models, so we use Quadro RT 6000 for related experiments. V ← next unfilled variable 8: for CANDIDATE in (CANDIDATES for V) do 9: Fill V in STATE 10: z ← confidence score of evaluating STATE with Gi 11: if z > t then 12: CURRSTATES.append(STATE) 13: Sort (descending) CURRSTATES by z 14: PREVSTATES ← top w states in CURRSTATES 15: return CURRSTATES Number of Parameters. The two trainable modules (FILL and FIND) adopt BERT-base as backbone, using 110 million parameters for each. We use several downstream MRC models in our experiments, and BERT-large is the biggest among all (340 million). To sum, NMTeacher-Joint uses 560 million parameters at most.
Hyper-parameters. We use Adam with linear warmup as our optimizer and we tuned learning rate in the range of {1e − 5, 2e − 5, 3e − 5, 4e − 5, 5e−5}. We set the warmup steps to be either 100 or 500. We tuned the loss balancing co-efficient β (in L ap and L au , see Sec. 4.2) in the range of {0.1, 0.2, 0.3, 0.4, 0.5}. We adopt a greedy tuning strategy: first select the best learning rate and fix it; then select the best co-efficient β. We select parameters based on F1 score on dev set.
We set the rotation interval r (see Sec. 4.2) to be 8. We use batch size of 12 for BERT-l; 16 for BERT-b; 16 for ALBERT-b. Gradient accumulation is adopted to achieve such batch size with GPU memory constraint.
Datasets. We download both datasets we use from official websites. SQuAD: https: //rajpurkar.github.io/SQuAD-explorer/; Natural Questions: https://ai.google.com/ research/NaturalQuestions/download. Note that we customized the settings of NQ dataset as we limit our scope to MRC task. We aim to analysis the capability of NMTeacher in different scenarios, and thus we choose not to use the official test set due to submission constraints (e.g., one attempt per week). We create our own dev and test set (see Sec. 5.1). Development Set Performance. Table 4 and 6 in the main paper lists test set performance, while their corresponding development set performance can be found in Table 11 and 12.

E Explanation Collection
Our interface for collecting semi-structured explanations with Amazon Mechanical Turk is shown in Figure 7. Annotators are required to first read a short paragraph of high-level instructions and then read five provided examples. After that, they are required to write an explanation for a provided answered (q, c, a) triple in one single text input box, using suggested expressions in a provided table.
Finally, annotators are required to double-check their explanation before they submit. The reward for each accepted explanation is $0.5. We automatically rejected responses not following instructions (e.g., not mentioning any variables, quoted words do not appear in context). Statistics of the collected explanations on SQuAD and NQ datasets are previously shown in Table 5. We constructed and modified our parser simultaneously with the explanation collection process. The accuracy of semantic parsing is 91.93% by manual inspection on 35 parsed explanations (161 sentences).