Unsupervised Question Decomposition for Question Answering

We aim to improve question answering (QA) by decomposing hard questions into easier sub-questions that existing QA systems can answer. Since collecting labeled decompositions is cumbersome, we propose an unsupervised approach to produce sub-questions. Specifically, by leveraging>10M questions from Common Crawl, we learn to map from the distribution of multi-hop questions to the distribution of single-hop sub-questions. We answer sub-questions with an off-the-shelf QA model and incorporate the resulting answers in a downstream, multi-hop QA system. On a popular multi-hop QA dataset, HotpotQA, we show large improvements over a strong baseline, especially on adversarial and out-of-domain questions. Our method is generally applicable and automatically learns to decompose questions of different classes, while matching the performance of decomposition methods that rely heavily on hand-engineering and annotation.


Introduction
Question answering (QA) systems have become remarkably good at answering simple, single-hop questions but still struggle with compositional, multi-hop questions (Yang et al., 2018;Hudson & Manning, 2019). In this work, we examine if we can answer hard questions by leveraging our ability to answer simple questions. Specifically, we approach QA by breaking a hard question into a series of sub-questions that can be answered by a simple, single-hop QA system. The system's answers can then be given as input to a downstream QA system to answer the hard question, as shown in Fig. 1. Our approach thus answers the hard question in multiple, smaller steps, which can be easier than answering the hard question all at once. For example, it may be easier to answer "What profession do H. L. Mencken and Albert Camus have in common?" when given the answers Code will be available soon. 1 Facebook AI Research 2 New York University 3 University College London 4 CIFAR Azrieli Global Scholar. Correspondence to: Ethan Perez <perez@nyu.edu>. Figure 1. Overview: Using unsupervised learning, we decompose a multi-hop question into single-hop sub-questions, whose predicted answers are given to a downstream question answering model.
Prior work in learning to decompose questions into subquestions has relied on extractive heuristics, which generalizes poorly to different domains and question types, and requires human annotation (Talmor & Berant, 2018;Min et al., 2019b). In order to scale to any arbitrary question, we would require sophisticated natural language generation capabilities, which often relies on large quantities of highquality supervised data. Instead, we find that it is possible to learn to decompose questions without supervision.
Specifically, we learn to map from the distribution of hard questions to the distribution of simpler questions. First, we automatically construct a noisy, "pseudo-decomposition" for each hard question by retrieving relevant sub-question candidates based on their similarity to the given hard question. We retrieve candidates from a corpus of 10M simple questions that we extracted from Common Crawl. Second, we train neural text generation models on that data with (1) standard sequence-to-sequence learning and (2) unsupervised sequence-to-sequence learning. The latter has the advantage that it can go beyond the noisy pairing between questions and pseudo-decompositions. Fig. 2  Hard Question Simple Question ?
Step 1 Step 2 Seq2Seq or Step 2 Figure 2. Unsupervised Decomposition: Step 1: We create a corpus of pseudo-decompositions D by finding candidate subquestions from a simple question corpus S which are similar to a multi-hop question in Q.
Step 2: We learn to map multi-hop questions to decompositions using Q and D as training data, via either standard or unsupervised sequence-to-sequence learning.
We use decompositions to improve multi-hop QA. We first use an off-the-shelf single-hop QA model to answer decomposed sub-questions. We then give each sub-question and its answer as additional input to a multi-hop QA model. We test our method on HOTPOTQA (Yang et al., 2018), a popular multi-hop QA benchmark.
Our contributions are as follows. First, QA models relying on decompositions improve accuracy over a strong baseline by 3.1 F1 on the original dev set, 11 F1 on the multi-hop dev set from Jiang & Bansal (2019a), and 10 F1 on the out-of-domain dev set from Min et al. (2019b). Our most effective decomposition model is a 12-block transformer encoder-decoder (Vaswani et al., 2017) trained using unsupervised sequence-to-sequence learning, involving masked language modeling, denoising, and back-translation objectives (Lample & Conneau, 2019). Second, our method is competitive with state-of-the-art methods SAE (Tu et al., 2020) and HGN (Fang et al., 2019) which leverage strong supervision. Third, we show that our approach automatically learns to generate useful decompositions for all 4 question types in HOTPOTQA, highlighting the general nature of our approach. In our analysis, we explore how sub-questions improve multi-hop QA, and we provide qualitative examples that highlight how question decomposition adds a form of interpretability to black-box QA models. Our ablations show that each component of our pipeline contributes to QA performance. Overall, we find that it is possible to successfully decompose questions without any supervision and that doing so improves QA.

Method
We now formulate the problem and overview our highlevel approach, with details in the following section. We aim to leverage a QA model that is accurate on simple questions to answer hard questions, without using supervised question decompositions. Here, we consider simple questions to be "single-hop" questions that require reasoning over one paragraph or piece of evidence, and we consider hard questions to be "multi-hop." Our aim is then to train a multi-hop QA model M to provide the correct answer a to a multi-hop question q about a given a context c (e.g., several paragraphs). Normally, we would train M to maximize log p M (a|c, q). To help M , we leverage a single-hop QA model that may be queried with subquestions s 1 , . . . , s N , whose "sub-answers" to each subquestion a 1 , . . . , a N may be provided to the multi-hop QA model. M may then instead maximize the (potentially easier) objective log p M (a|c, q, [s 1 , a 1 ], . . . , [a N , s N ]).
Supervised decomposition models learn to map each question q ∈ Q to a decomposition d = [s 1 ; . . . ; s N ] of N subquestions s n ∈ S using annotated (q, d) examples. In this work, we do not assume access to strong (q, d) supervision.
To leverage the single-hop QA model without supervision, we follow a three-stage approach: 1) map a question q into sub-questions s 1 , . . . , s N via unsupervised techniques, 2) find sub-answers a 1 , . . . , a N with the single-hop QA model, and 3) provide s 1 , . . . , s N and a 1 , . . . , a N to help predict a.

Unsupervised Question Decomposition
To train a decomposition model, we need appropriate training data. We assume access to a hard question corpus Q and a simple question corpus S. Instead of using supervised (q, d) training examples, we design an algorithm that constructs pseudo-decompositions d to form (q, d ) pairs from Q and S using an unsupervised approach ( §2.1.1). We then train a model to map q to a decomposition. We explore learning to decompose with standard and unsupervised sequence-to-sequence learning ( §2.1.2).

CREATING PSEUDO-DECOMPOSITIONS
For each q ∈ Q, we construct a pseudo-decomposition set d = {s 1 ; . . . ; s N } by retrieving simple question s from S. We concatenate all N simple questions in d to form the pseudo-decomposition used downstream. N may be chosen based on the task or vary based on q. To retrieve useful simple questions for answering q, we face a joint optimization problem. We want sub-questions that are both (i) similar to q according to some metric f and (ii) maximally diverse:

LEARNING TO DECOMPOSE
Having now retrieved relevant pseudo-decompositions, we examine different ways to learn to decompose (with implementation details in the following section): No Learning We use pseudo-decompositions directly, employing retrieved sub-questions in downstream QA.
Sequence-to-Sequence (Seq2Seq) We train a Seq2Seq model with parameters θ to maximize log p θ (d |q).
Unsupervised Sequence-to-Sequence (USeq2Seq) We start with paired (q, d ) examples but do not learn from the pairing, because the pairing is noisy. We use unsupervised sequence-to-sequence learning to learn a q → d mapping instead of training directly on the noisy pairing.

Answering Sub-Questions
To answer the generated sub-questions, we use an off-theshelf QA model. The QA model may answer sub-questions using any free-form text (i.e., a word, phrase, sentence, etc.). Any QA model is suitable, so long as it can accurately answer simple questions in S. We thus leverage good accuracy on questions in S to help QA models on questions in Q.

QA using Decompositions
Downstream QA systems may use sub-questions and subanswers in various ways. We add sub-questions and subanswers as auxiliary input for a downstream QA model to incorporate in its processing. We now describe the implementation details of our approach outlined above.

Question Answering Task
We test unsupervised decompositions on HOTPOTQA (Yang et al., 2018), a standard benchmark for multi-hop QA. We use HOTPOTQA's "Distractor Setting," which provides 10 context paragraphs from Wikipedia. Two (or more) paragraphs contain question-relevant sentences called "supporting facts," and the remaining paragraphs are irrelevant, "distractor paragraphs." Answers in HOTPOTQA are either yes, no, or a span of text in an input paragraph. Accuracy is measured with F1 and Exact Match (EM) scores between the predicted and gold spans.

QUESTION DATA
We use HOTPOTQA questions as our initial multi-hop, hard question corpus Q. We use SQUAD 2 questions as our initial single-hop, simple question corpus S. However, our pseudo-decomposition corpus should be large, as the corpus will be used to train neural Seq2Seq models, which are data hungry. A larger |S| will also improve the relevance of retrieved simple questions to the hard question. Thus, we take inspiration from work in machine translation on parallel corpus mining (Xu & Koehn, 2017;Artetxe & Schwenk, 2019) and in unsupervised QA (Lewis et al., 2019). We augment Q and S by mining more questions from Common Crawl. We choose sentences which start with common "wh"-words and end with "?" Next, we train a FastText classifier  to classify between 60K questions sampled from Common Crawl, SQUAD 2, and HOTPOTQA. Then, we classify Common Crawl questions, adding questions classified as SQUAD 2 questions to S and questions classified as HOTPOTQA questions to Q. Question mining greatly increases the number of single-hop questions (130K → 10.1M) and multi-hop questions (90K → 2.4M). Thus, our unsupervised approach allows us to make use of far more data than supervised counterparts.

CREATING PSEUDO-DECOMPOSITIONS
To create pseudo-decompositions, we set the number N of sub-questions per question to 2, as questions in HOTPOTQA usually involve two reasoning hops. In Appendix §A.1, we discuss how our method works when N varies per question.
Similarity-based Retrieval To retrieve question-relevant sub-questions, we embed any text t into a vector v t by summing the FastText vectors  1 for words in t. 2 We use cosine similarity as our similarity metric f . Let q be a multi-hop question used to retrieve pseudo-decomposition (s * 1 , s * 2 ), and letv be the unit vector of v. Since N = 2, Eq. 1 reduces to: The last term requires O(|S| 2 ) comparisons, which is expensive as |S| is large (>10M). Instead of solving Eq.

Random Retrieval
For comparison, we test random pseudo-decompositions, where we randomly retrieve s 1 , . . . , s N by sampling from S. USeq2Seq trained on random d = [s 1 ; . . . ; s N ] should at minimum learn to map q to multiple simple questions.
Editing Pseudo-Decompositions Since the subquestions are retrieval-based, the sub-questions are often not about the same entities as q. As a post-processing step, we replace entities in (s 1 , s 2 ) with entities from q. We find all entities in (s 1 , s 2 ) that do not appear in q using spaCy (Honnibal & Montani, 2017). We replace these entities with a random entity from q with the same type (e.g., "Date" or "Location") if and only if one exists. We use entity replacement on pseudo-decompositions from both random and similarity-based retrieval.

UNSUPERVISED DECOMPOSITION MODELS
Pre-training Pre-training is a key ingredient for unsupervised Seq2Seq methods (Artetxe et al., 2018;Lample et al., 2018), so we initialize all decomposition models with the same pre-trained weights, regardless of training method (Seq2Seq or USeq2Seq). We warm-start our pretraining with the pre-trained, English Masked Language Model (MLM) from Lample & Conneau (2019), a 12block decoder-only transformer model (Vaswani et al., 2017) trained to predict masked-out words on Toronto Books Corpus (Zhu et al., 2015) and Wikipedia. We train the model with the MLM objective for one epoch on the augmented corpus Q (2.4 M questions), while also training on decompositions D formed via random retrieval from S. For our pre-trained encoder-decoder, we initialize a 6-block encoder with the first 6 MLM blocks, and we initialize a 6-block decoder with the last 6 MLM blocks, randomly initializing the remaining weights as in Lample & Conneau (2019).

Seq2Seq
We fine-tune the pre-trained encoder-decoder using maximum likelihood. We stop training based on validation BLEU (Papineni et al., 2002) between generated decompositions and pseudo-decompositions.
USeq2Seq We follow the approach by Lample & Conneau (2019) in unsupervised translation. 3 Training follows two stages: (1) MLM pre-training on the training corpora (described above), followed by (2) training simultaneously with denoising and back-translation objectives. For denoising, we produce a noisy inputd by randomly masking, dropping, and locally shuffling tokens in d ∼ D, and we train a model with parameters θ to maximize log p θ (d|d). We likewise maximize log p θ (q|q). For back-translation, we generate a multi-hop questionq for a decomposition d ∼ D, and we maximize log p θ (d|q). Similarly, we maximize log p θ (q|d). To stop training without supervision, we use a modified version of round-trip BLEU (Lample et al., 2018) (see Appendix §B.1 for details). We train with denoising and back-translation on smaller corpora of HOTPOTQA questions (Q) and their pseudo-decompositions (D). 4

Single-hop Question Answering Model
We train our single-hop QA model following prior work from Min et al. (2019b) on HOTPOTQA. 5 Model Architecture We fine-tune a pre-trained model to take a question and several paragraphs and predicts the answer, similar to the single-hop QA model from Min et al. (2019a). The model computes a separate forward pass on each paragraph (with the question). For each paragraph, the model learns to predict the answer span if the paragraph contains the answer and to predict "no answer" otherwise. We treat yes and no predictions as spans within the passage (prepended to each paragraph), as in Nie et al. (2019) on HOTPOTQA. During inference, for the final softmax, we consider all paragraphs as a single chunk. Similar to Clark & Gardner (2018), we subtract a paragraph's "no answer" logit from the logits of all spans in that paragraph, to reduce or increase span probabilities accordingly. In other words, we compute the probability p(s p ) of each span s p in a paragraph p ∈ {1, . . . , P } using the predicted span logit l(s p ) and "no answer" paragraph logit n(p) as follows: We use ROBERTA LARGE (Liu et al., 2019) as our pretrained initialization. Later, we also experiment with using the BERT BASE ensemble from Min et al. (2019b).
Training Data and Ensembling Similar to Min et al. (2019b), we train an ensemble of 2 single-hop QA models using data from SQUAD 2 and HOTPOTQA questions labeled as "easy" (single-hop). To ensemble, we average the logits of the two models before predicting the answer. SQUAD is a single-paragraph QA task, so we adapt SQUAD to the multi-paragraph setting by retrieving distractor paragraphs from Wikipedia for each question. We use the TFIDF retriever from DrQA (Chen et al., 2017) to retrieve 2 distractor paragraphs, which we add to the input for one model in the ensemble. We drop words from the question with a 5% probability to help the model handle any ill-formed sub-questions. We use the single-hop QA ensemble as a black-box model once trained, never training the model on multi-hop questions.

Returned Text
We have the single-hop QA model return the sentence containing the model's predicted answer span, alongside the sub-questions. Later, we compare against alternatives, i.e., returning the predicted answer span without its context or not returning sub-questions.

Results on Question Answering
We compare variants of our approach that use different learning methods and different pseudo-aligned training sets. As a baseline, we compare ROBERTA with decompositions to a ROBERTA model that does not use decompositions but is identical in all other respects. We train the baseline for 2 epochs, sweeping over batch size ∈ {64, 128}, learning rate ∈ {1 × 10 −5 , 1.5 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 }, and weight decay ∈ {0, 0.1, 0.01, 0.001}; we choose the hyperparameters that perform best on our dev set. We then use the best hyperparameters for the baseline to train our ROBERTA models with decompositions.
We report results on 3 versions of the dev set: (1) the original version, 6 (2) the multi-hop version from Jiang & Bansal (2019a) which created some distractor paragraphs adversari-ally to test multi-hop reasoning, and (3) the out-of-domain version from Min et al. (2019b) which retrieved distractor paragraphs using the same procedure as the original version, but excluded paragraphs in the original version.
Main Results Table 1 shows how unsupervised decompositions affect QA. Our ROBERTA baseline performs quite well on HOTPOTQA (77.0 F1), despite processing each paragraph separately, which prohibits inter-paragraph reasoning. The result is in line with prior work which found that a version of our baseline QA model using BERT (Devlin et al., 2019) does well on HOTPOTQA by exploiting single-hop reasoning shortcuts (Min et al., 2019a). We achieve significant gains over our strong baseline by leveraging decompositions from our best decomposition model, trained with USeq2Seq on FastText pseudo-decompositions; we find a 3.  Table 3. Ablation Study: QA model F1 when trained with different sub-answers: the sentence containing the predicted subanswer, the predicted sub-answer span, and a random entity from the context. We also train QA models with ( ) or without () sub-questions and sub-answers.
et al., 2019), which both (unlike our approach) learn from additional, strong supervision about which sentences are necessary to answer the question.

Question Type Breakdown
To understand where decompositions help, we break down QA performance across 4 question types from Min et al. (2019b). "Bridge" questions ask about an entity not explicitly mentioned in the question ("When was Erik Watts' father born?"). "Intersection" questions ask to find an entity that satisfies multiple separate conditions ("Who was on CNBC and Fox News?"). "Comparison" questions ask to compare a property of two entities ("Which is taller, Momhil Sar or K2?"). "Single-hop" questions are likely answerable using single-hop shortcuts or single-paragraph reasoning ("Where is Electric Six from?"). We split the original dev set into the 4 types using the supervised type classifier from Min et al. (2019b).  puts, as shown in Table 3. Sub-answers are crucial to improving QA, as sub-questions with no answers or random answers do not help (76.9 vs. 77.0 F1 for the baseline). Only when sub-answers are provided do we see improved QA, with or without sub-questions (80.1 and 80.2 F1, respectively). It is important to provide the sentence containing the predicted answer span instead of the answer span alone (80.1 vs. 77.8 F1, respectively), though the answer span alone still improves over the baseline (77.0 F1).

How Do Decompositions Help?
Decompositions help to answer questions by retrieving important supporting evidence to answer questions. Fig. 3 shows that multi-hop QA accuracy increases when the subanswer sentences are the "supporting facts" or sentences needed to answer the question, as annotated by HOTPOTQA. We retrieve supporting facts without learning to predict them with strong supervision, unlike many state-of-the-art models (Tu et al., 2020;Fang et al., 2019;Nie et al., 2019).

Example Decompositions
To illustrate how decompositions help QA, Table 4 shows example sub-questions from our best decomposition model with predicted sub-answers.  Table 4. Example sub-questions generated by our model, along with predicted sub-answer sentences (answer span underlined) and final predicted answer.

Unsupervised Decomposition Model
Intrinsic Evaluation of Decompositions We evaluate the quality of decompositions on other metrics aside from downstream QA. To measure the fluency of decompositions, we compute the likelihood of decompositions using the pretrained GPT-2 language model (Radford et al., 2019). We train a classifier on the question-wellformedness dataset of Faruqui & Das (2018), and we use the classifier to estimate the proportion of sub-questions that are well-formed. We measure how abstractive decompositions are by computing (i) the token Levenstein distance between the multi-hop question and its generated decomposition and (ii) the ratio between the length of the decomposition and the length of the multi-hop question. We compare our best decomposition model against the supervised+heuristic decompositions from DECOMPRC (Min et al., 2019b) in Table 5.
Unsupervised decompositions are both more natural and well-formed than decompositions from DECOMPRC. Unsupervised decompositions are also closer in edit distance and length to the multi-hop question, consistent with our obser-  vation that our decomposition model is largely extractive.

Quality of Decomposition Model
Another way to test the quality of the decomposition model is to test if the model places higher probability on decompositions that are more helpful for downstream QA. We generate N = 5 hypotheses from our best decomposition model using beam search, and we train a multi-hop QA model to use the n thranked hypothesis as a question decomposition (Fig. 4, left). QA accuracy decreases as we use lower probability decompositions, but accuracy remains relatively robust, at most decreasing from 80.1 to 79.3 F1. The limited drop suggests that decompositions are still useful if they are among the model's top hypotheses, another indication that our model is trained well for decomposition.

Single-hop Question Answering Model
Sub-Answer Confidence Figure 4 (right) shows that the model's sub-answer confidence correlates with downstream multi-hop QA performance for all HOTPOTQA dev sets. A low confidence sub-answer may be indicative of (i) an unanswerable or ill-formed sub-question or (ii) a sub-answer that is more likely to be incorrect. In both cases, the singlehop QA model is less likely to retrieve the useful supporting evidence to answer the multi-hop question.

Multi-hop Question Answering Model
Varying the Base Model To understand how decompositions impact performance as the multi-hop QA model gets stronger, we vary the base pre-trained model.

Related Work
Answering complicated questions has been a long-standing challenge in natural language processing. To this end, prior work has explored decomposing questions with supervision or heuristic algorithms. IBM Watson (Ferrucci et al., 2010) decomposes questions into sub-questions in multiple ways or not at all. DECOMPRC (Min et al., 2019b) largely frames sub-questions as extractive spans of a multi-hop question, learning to predict span-based sub-questions via supervised learning on human annotations. In other cases, DECOMPRC decomposes a multi-hop question using a heuristic algorithm, or DecompRC does not decompose at all. Watson and DECOMPRC use special case handling to decompose different questions, while our algorithm is fully automated and requires minimal hand-engineering.
More traditional, semantic parsing methods map questions to compositional programs, whose sub-programs can be viewed as question decompositions in a formal language (Talmor & Berant, 2018;Wolfson et al., 2020). Examples include classical QA systems like SHRDLU (Winograd, 1972) and LUNAR (Woods et al., 1974), as well as neural Seq2Seq semantic parsers (Dong & Lapata, 2016) and neural module networks (Andreas et al., 2015;. Such methods usually require strong, program-level supervision to generate programs, as in visual QA (Johnson et al., 2017b) and on HOTPOTQA (Jiang & Bansal, 2019b). Some models use other forms of strong supervision, e.g. predicting the "supporting evidence" to answer a question annotated by HOTPOTQA. Such an approach is taken by SAE (Tu et al., 2020) and HGN (Fang et al., 2019), whose methods may be combined with our approach.
Unsupervised decomposition complements strongly and weakly supervised decomposition approaches. Our unsupervised approach enables methods to leverage millions of otherwise unusable questions, similar to work on unsupervised QA (Lewis et al., 2019). When decomposition examples exist, supervised and unsupervised learning can be used in tandem to learn from both labeled and unlabeled examples. Such semi-supervised methods outperform supervised learning for tasks like machine translation (Sennrich et al., 2016). Other work on weakly supervised question generation uses a downstream QA model's accuracy as a signal for learning to generate useful questions. Weakly supervised question generation often uses reinforcement learning (Nogueira & Cho, 2017;Wang & Lake, 2019;Strub et al., 2017;Das et al., 2017;Liang et al., 2018), where an unsupervised initialization can greatly mitigate the issues of exploring from scratch (Jaderberg et al., 2017).

Conclusion
We proposed an algorithm that decomposes questions without supervision, using 3 stages: (1) learning to decompose using pseudo-decompositions without supervision, (2) answering sub-questions with an off-the-shelf QA system, and (3) answering hard questions more accurately using sub-questions and their answers as additional input. When evaluated on HOTPOTQA, a standard benchmark for multihop QA, our approach significantly improved accuracy over an equivalent model that did not use decompositions. Our approach relies only on the final answer as supervision but works as effectively as state-of-the-art methods that rely on strong supervision, such as supporting fact labels or example decompositions. Qualitatively, we found that unsupervised decomposition resulted in fluent sub-questions whose answers often match the annotated supporting facts in HOTPOTQA. Our unsupervised decompositions are largely extractive, which is effective for compositional, multi-hop questions but not all complex questions, showing room for future work. Overall, this work opens up exciting avenues for leveraging methods in unsupervised learning and natural language generation to improve the interpretability and generalization of machine learning systems.

A. Pseudo-Decompositions
Tables 8-13 show examples of pseudo-decompositions and learned decompositions from various models.

A.1. Variable Length Pseudo-Decompositions
In §3.2.2, we leveraged domain knowledge about the task to fix the pseudo-decomposition length N = 2. A general algorithm for creating pseudo-decompositions should find a suitable N for each question. We find that Eq. 1 in 2.1.1 always results in decompositions of length N = 2, as the regularization term grows quickly with N . Thus, we test another formulation based on Euclidean distance: We create pseudo-decompositions in an similar way as before, first finding a set of candidate sub-questions S ∈ S with high cosine similarity to v q , then performing beam search up to a maximum value of N . We test pseudodecomposition formulations by creating synthetic compositional questions by combining 2-3 single-hop questions with "and." We then measure the ranking of the correct decomposition (a concatenation of the single-hop questions). For N = 2, both methods perform well, but Eq. 1 does not work for decompositions where N = 3, whereas Eq. 4 does, achieving a mean reciprocal rank of 30%. However, Eq. 1 outperforms Eq. 4 on HOTPOTQA, e.g., achieving 79.9 vs. 79.4 F1 when using the BERT BASE ensemble from Min et al. (2019b) to answer sub-questions. Eq. 1 is also faster to compute and easier to scale. Moreover, Eq. 4 requires an embedding space where summing sub-question representations is meaningful, whereas Eq. 1 only requires embeddings that encode semantic similarity. Thus, we adopt Eq. 1 for our main experiments. Table 8 contains an example where the variable length decomposition method mentioned above produces a three-subquestion decomposition whereas the other methods are fixed to two subquestions.

A.2. Impact of Question Corpus Size
In addition to our previous results on FastText vs. Random pseudo-decompositions, we found it important to use a large question corpus to create pseudo-decompositions. QA F1 increased from 79.2 to 80.1 when we trained decomposition models on pseudo-decompositions comprised of questions retrieved from Common Crawl (>10M questions) rather than only SQUAD 2 (∼130K questions), using an appropriately larger beam size (100 → 1000).  (Tu et al., 2020) 80.2 61.1 62.6 HGN (Fang et al., 2019) 82.2 78.9 76.1 Table 7. QA F1 scores for all combinations of learning methods and pseudo-decomposition retrieval methods that we tried. Table 7 shows QA results with pseudo-decompositions retrieved using sum-bag-of-word representations from Fast-Text, TFIDF, BERT LARGE first layer hidden states. We also vary the learning method and include results Curriculum Seq2Seq (CSeq2Seq), where we initialize the USeq2Seq approach with the Seq2Seq model trained on the same data.

B.1. Unsupervised Stopping Criterion
To stop USeq2Seq training, we use an unsupervised stopping criterion to avoid relying on a supervised validation set of decompositions. We generate a decompositiond for a multi-hop question q, and we measure BLEU between q and the model-generated questionq ford, similar to round-trip BLEU in unsupervised translation (Lample et al., 2018). We scale round-trip BLEU score by the fraction of "good" decompositions, where a good decomposition has (1) 2 sub-questions (question marks), (2) no sub-question which contains all words in the multi-hop question, and (3) no sub-question longer than the multi-hop question. Without scaling, decomposition models achieve perfect round-trip BLEU by copying the multi-hop question as the decomposition. We measure scaled BLEU across multi-hop questions in HOTPOTQA dev, and we stop training when the metric does not increase for 3 consecutive epochs.
It is possible to stop training the decomposition model  based on downstream QA accuracy. However, training a QA model on each decomposition model checkpoint (1) is computationally expensive and (2) ties decompositions to a specific, downstream QA model. In Figure 5, we show downstream QA results across various USeq2Seq checkpoints when using the BERT BASE single-hop QA ensemble from Min et al. (2019b). The unsupervised stopping criterion does not significantly hurt downstream QA compared to using a weakly-supervised stopping criterion.

B.2. Training Hyperparameters
MLM Pre-training We pre-train our encoder-decoder distributed across 8 DGX-1 machines, each with 8, 32GB NVIDIA V100 GPUs interconnected by Infiniband. We pre-train using the largest possible batch size (1536), and we choose the best learning rate (3 × 10 −5 ) based on training loss after a small number of iterations. We chose a maximum sequence length of 128. We keep other hyperparameters identical to those from Lample & Conneau (2019) used in unsupervised translation.

USeq2Seq
We train each decomposition model with distributed training across 8, 32GB NVIDIA V100 GPUs. We chose the largest possible batch size (256) and then the largest learning rate which resulted in stable training (3 × 10 −5 ). Other hyperparameters are the same as Lample & Conneau (2019).

Seq2Seq
We use a large batch size (1024) and chose the largest learning rate which resulted in stable training across many pseudo-decomposition training corpora (1 × 10 −4 ). We keep other training settings and hyperparameters the same as for USeq2Seq.
For BERT BASE , we thus choose learning rate 2 × 10 −5 and batch size 16, and for BERT LARGE , we use the wholeword masking model with learning rate 2 × 10 −5 and batch size 32. We train all QA models with mixed precision floating point arithmetic (Micikevicius et al., 2018), distributing training across 8, 32GB NVIDIA V100 GPUs.  Here, we use the original bridge/comparison splits from HOT-POTQA, which does not have a one-hop category and categorizes intersection questions as bridge. For the original dev set, the improvement with decompositions is greater for comparison questions than bridge questions. The multi-hop set does not alter comparison questions from the original version, so these scores do not change much.

C.3. Improvements across Detailed Question Types
To better understand where decompositions improve QA, we show the improvement across various fine-grained splits of the evaluation sets in Figures 7-11. Improvements by question word vary across dev sets. a n s _ in _ s u b a 1 a n s _ n o t _ in _ s u b a 2 a n s _ in _ s u b a a n s _ in _ s u b a 2 a n s _ n o t _ in _ s u b a 1 a n s _ n o t _ in _ s u b a  Figure 11. Performance difference between a model when the QA model does vs. does not use decompositions, stratified by whether the gold final answer is in a sub-answer sentence. We find a larger improvement over the baseline when the gold answer contained in a sub-answer sentence. Table 8. Various decomposition methods for the question "What is the name of the singer who's song was released as the lead single from the album "Confessions," and that had popular song stuck behind for eight consecutive weeks?" Here, the Variable USeq2Seq model has decomposed the question into three subquestions rather than two.
Q: Are both Coldplay and Pierre Bouvier from the same country?

USeq2Seq + Random
Sub-Q1 why are both coldplay and pierre bouvier from the same country? Sub-A1 Coldplay are a British rock band formed in 1996 by lead vocalist and keyboardist Chris Martin and lead guitarist Jonny Buckland at University College London (UCL).
Sub-Q2 what is the purpose of a speech? Sub-A2 Pierre Charles Bouvier (born 9 May 1979) is a Canadian singer, songwriter, musician, composer and actor who is best known as the lead singer and guitarist of the rock band Simple Plan.

USeq2Seq + FastText
Sub-Q1 where are coldplay and coldplay from? Sub-A1 Coldplay are a British rock band formed in 1996 by lead vocalist and keyboardist Chris Martin and lead guitarist Jonny Buckland at University College London (UCL).
Sub-Q2 what country is pierre bouvier from? Sub-A2 Pierre Charles Bouvier (born 9 May 1979) is a Canadian singer, songwriter, musician, composer and actor who is best known as the lead singer and guitarist of the rock band Simple Plan.

Seq2Seq + Random
Sub-Q1 what is the maximum number of students allowed to take part in the program? Sub-Q2 where are the french alps? Sub-A2 St Pierre is a former parish and hamlet in Monmouthshire, south east Wales, 3 mi south west of Chepstow and adjacent to the Severn estuary.

DecompRC
Sub-Q1 is coldplay from which country? Sub-A1 Coldplay are a British rock band formed in 1996 by lead vocalist and keyboardist Chris Martin and lead guitarist Jonny Buckland at University College London (UCL). Sub-Q2 is pierre bouvier from which country? Sub-A2 Pierre Charles Bouvier (born 9 May 1979) is a Canadian singer, songwriter, musician, composer and actor who is best known as the lead singer and guitarist of the rock band Simple Plan.

USeq2Seq + FastText
Sub-Q1 who are similar musical artists to coldplay? Sub-A1 pierre charles bouvier (born 9 may 1979) is a canadian singer, songwriter, musician, composer and actor who is best known as the lead singer and guitarist of the rock band simple plan. Sub-Q2 where is pierre bouvier from? Sub-A2 pierre charles bouvier (born 9 may 1979) is a canadian singer, songwriter, musician, composer and actor who is best known as the lead singer and guitarist of the rock band simple plan. Table 9. Various decomposition methods for the question "Are both Coldplay and Pierre Bouvier from the same country?"