Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text

This paper focuses on the topic of inferential machine comprehension, which aims to fully understand the meanings of given text to answer generic questions, especially the ones needed reasoning skills. In particular, we first encode the given document, question and options in a context aware way. We then propose a new network to solve the inference problem by decomposing it into a series of attention-based reasoning steps. The result of the previous step acts as the context of next step. To make each step can be directly inferred from the text, we design an operational cell with prior structure. By recursively linking the cells, the inferred results are synthesized together to form the evidence chain for reasoning, where the reasoning direction can be guided by imposing structural constraints to regulate interactions on the cells. Moreover, a termination mechanism is introduced to dynamically determine the uncertain reasoning depth, and the network is trained by reinforcement learning. Experimental results on 3 popular data sets, including MCTest, RACE and MultiRC, demonstrate the effectiveness of our approach.


Introduction
Machine comprehension is one of the hot research topics in natural language processing. It measures the machine's ability to understand the semantics of a given document via answering questions related to the document. Towards this task, many datasets and corresponding methods have been proposed. In most of these datasets, such as CNN/Daily Mail (Hermann et al., 2015), and SQuAD (Rajpurkar et al., 2016), the answer is often a single entity or a text span in the document. That leads to the fact that lots of questions can be solved trivially via word and context matching (Trischler et al., 2016a) instead of * Corresponding author. Figure 1: Sample of question needed reasoning skill. Correct answer is marked with an asterisk genuine comprehension on text. To alleviate this issue, some datasets are released, such as M-CTest (Richardson et al., 2013), RACE (Lai et al., 2017) and MultiRC (Khashabi et al., 2018), where the answers are not restricted to be the text spans in the document; instead, they can be described in any words. Specially, a significant proportion of questions require reasoning which is a sophisticated comprehension ability to choose the right answers. As shown in Figure 1, the question asks the reason for the phenomenon on sentence S5. The answer has to be deduced over the logical relations among sentence S3, S4 and S5, and then entailed from S3 to the correct option B. Difficultly, such deduced chain is not explicitly given but expressed on text semantics. Existing methods primarily focus on document-question interaction to capture the context similarity for answer span matching . They have minimal capability to synthesize supported facts scattered across multiple sentences to form the evidence chain which is crucial for reasoning.
To support inference, mainstream methods can be summarized into three folds. One is converting the unstructured document to formal predicate expressions, on which to perform mathematical d-eduction via Bayesian network or first-order logic. The conversion lacks of adequate robustness to be applicable. Another direction is to explicitly parse the document into a relation tree, on which to generate answers via hand-crafted rules (Sun et al., 2018b). However, the parser often has to cascade to the model, which is difficult to train globally and would suffer from the error propagation problem. The third method exploits memory network to imitate reasoning by multi-layer architecture and iterative attention mechanism (Weston et al., 2014). Nevertheless, the reasoning ability is insufficient due to the lack of prior structural knowledge to lead the inference direction.
We observe that when humans answer the inferential question, they often finely analyze the question details and comprehend contextual relations to derive an evidence chain step by step. Using the sample in Figure 1 for illustration, humans first investigate the question to find the useful details, such as the question type "why", and the aspect asked, i.e. "some newspapers refused delivery to distant suburbs". Such details often play a critical role for answering. For example, why question usually expects the causal relation that could indicate the reasoning direction. Based on question details, they then carefully read the document to identify the content on which the question aspect mentions, that is, the sentence S5. Based on the content, they would deduce new supported evidences step by step guided by question type and contextual relations, such as explainable relation between S5 and S4, and casual relation among S3 and S4. By considering the options, they would decide to stop when the observed information is adequate already to answer the question. For instance, by relevant paraphrase, S3 can entail option B that may be the answer. In this process, contextual relations and multi-step deduction are efficient mechanisms for deriving the evidence chain.
Based on above observations, we here propose an end-to-end approach to mimic human process for deducing the evidence chain. In particular, we first encode the given document, question and options by considering contextual information. We then tackle the inference problem by proposing a novel network that consists of a set of operational cells. Each cell is designed with structural prior to capture the inner working procedure of an elementary reasoning step, where the step can be directly inferred from the text without strong supervi-sion. The cell includes the memory and three operating units that work in tandem. That is, master unit derives a series of attention-based operations based on the question; reader unit extracts relevant document content on the operation; and writer unit performs the operation to deduce a result and update the memory. The cells are recursively connected, where the result of the previous step acts as the context of next step. The interactions of cells are restricted by structural constraints, so as to regulate the reasoning direction. With such structural multi-step design, the network can integrate the supported facts by contextual relations to build the evidence chain in arbitrarily complex acyclic form. Since the reasoning depth is uncertain, a termination mechanism is exploited to adaptively determine the ending. Moreover, a reinforcement approach is employed for effective training. Experiments are conducted on 3 popular data sets that contain questions required reasoning skills, including MCTest, RACE and MultiRC. The results show the effectiveness of our approach.
The main contributions of this paper include, • We design a new network that can answer inferential question by recursively deducing the evidence chain from the text. • We propose an effective termination mechanism which can dynamically determine the uncertain reasoning depth. • We employ a reinforcement training approach and conduct extensive experiments.
The rest of this paper is organized as follows. Section 2 elaborates our approach on the inferential framework. Section 3 presents the experimental results. Section 4 reviews related work and Section 5 concludes this paper with future works.

Approach
As shown in Figure 2, our approach consists of three components, including input representation, inferential network composed out of multiple cells, and output. Next, we define some notations, and then elaborate the details on each component.

Notations and Problem Formulation
Given a document D in unstructured text, the task of machine comprehension is to answer the questions according to the semantics of the document. In this paper, multi-choice questions are our major focus. Thus, a set of plausible answer option- s are assumed to be provided, and the task is reduced to select a correct option from the given set. Formally, let q represent the question, of length S, where {w 1 , · · · , w S } are the question words; O = {o 1 , · · · , o L } denotes an option set. For a given document and question x = (D, q), a score h(x, y) ∈ R is assigned for each candidate in the option set y = o ∈ O, so as to measure its probability of being the correct answer. The option with highest score is outputted as the answer y = argmax y∈O h(x, y).

Input Representation
We first encode the input text into distributed vector representations by taking account the context.
Question: Two stages are conducted on the encoding. (1) We convert the question into a sequence of learned word embeddings by looking up the pre-trained vectors, such as GloVe (Pennington et al., 2014). By considering the question type would help inference, we customize an embedding to indicate such type via linguistic prior knowledge, such as the positions of interrogative words are often relatively fixed, and the corresponding parts of speech (POS) are mainly adverbs or conjunctions, etc. Practically, we utilize position and POS embedding (Li et al., 2018b) generated by word embedding tool. That is, the embedding layer is a W ∈ R d×v , where d is the dimension and v denotes the number of instances. (2) We concatenate the embeddings of word, position and POS, and feed them into a bi-directional GRU (Bi-GRU) (Cho et al., 2014) to incorporate sequential context. Then we can yield two kinds of representations, including (a) contextual words: a series of output states cw s | S s=1 that represent each word in the context of the question, where h s are the s th hidden states in the backward and forward GRU passes respectively; and , the concatenation of the final hidden states.
Options: Each option word is embedded by pre-trained vectors and then option is contextually encoded by BiGRU to generate an overall vector.
Document: Three steps are performed on the encoding. (1) We encode each document sentence by considering context via BiGRU as aforementioned. The sentence is transformed into an n i × d matrix, where n i is the size of words in the sentence i, and d is the dimension. (2) We conduct attention to compress the sentence encoding into a fixed size vector, and focus on the important components. Intuitively, long sentence may contain multiple significant parts, where each would help inference. For example, two clauses are linked by "or" with the causal relation in the sentence "The oil spill must be stopped or it will spread for miles." The clauses and the contextual relation can assist answer the question "Why must the oil spill be stopped?" To model such situation, structured self attention technique proposed by Lin et al. (2017) is utilized. It can convert the sentence into a J × d matrix, attending at J significant parts of the sentence in a context aware way.
(3) All sentence matrices are fed into another Bi-GRU, so as to capture the context between the sentences. That is D H×J×d = {ds d h,j | H,J h,j=1,1 }, where H is the sentences size, ds is the sentence vector.

Micro-Infer Cell
Micro-infer is a recurrent cell designed to model the mechanism of an atomic reasoning step. The cell consists of one memory unit and three operational units, including the master unit, reader unit and writer unit. The memory independently In particular, master unit analyzes the question details to focus on certain aspect via self-attention; reader unit then extracts related content, guided by the question aspect and text context; and the writer unit iteratively integrates the content with preceding results from the memory to produce a new intermediate result. The interactions between the cell's units are regulated by structured constrains. Specially, the master outcome can only indirectly guide the integration of relevant content into the memory state by soft-attention maps and gating mechanisms. Moreover, a termination gate is introduced to adaptively determine ending of the inference. In the following, we detail the formal specifications of three operational units in the cell.

Master Unit
As presented in Figure 3, this unit consists of two components, involving the termination mechanism and question analysis.
Termination Mechanism A maximum step is set to guarantee termination. Since the complexity of the questions is different, the reasoning depths are uncertain. To dynamically adjust to such depth, a terminated gate is designed by considering two conditions. That is, the correlation between the intermediate result m t−1 and the reasoning operation a t−1 in previous step, as well as m t−1 and candidate answer options o l | L l=1 . When both conditions are met, an acceptable answer is highly probable to obtain. Technically, the correlations are calculated by Eq.(1), i.e. m t−1 a t−1 , m t−1 o l , respectively. We then combine these two factors to get ta t,l , and utilize a sigmoid layer to estimate the ending probability for a certain option. By maximizing over all the options, a termination function . Based on the function, a binary random variable t t is probabilistically drawn as t t ∼ p(·|f ts (·; θ ts )). If t t is True, stop and execute the answer module accordingly; otherwise, continue the t th reasoning step.
We design a soft-attention based mechanism to analyze the question and determine the basic operation performed at each step. Instead of grasping the complex meaning on the whole question at once, the model is encouraged to focus on certain question aspect at a time, making the reasoning operation can be directly inferred from the text. Three stages are performed as follows.
Firstly, we project the question q through a learned linear transformation to derive the aspect related to t th reasoning step, as q t = W d×d qt q + b d qt . Secondly, we use the previously performed operation a t−1 and memory result m t−1 as decision base to lead t th reasoning operation. In details, we validate previous reasoning result by leveraging the terminated conditions in Eq.(1), that is, We then integrate q t with preceding operation a t−1 and validation pa t through a linear transformation into aq t , as W d×3d aq [q t , a t−1 , pa t ] + b d aq . Thirdly, aq t is regulated by casting it back to original question words cw s | S s=1 based on attention in Eq.(2), so as to restrict the space of the valid reasoning operations and boost the convergence rate. In particular, we calculate the correlation ac t,s and pass it through a softmax layer to yield a distribution av t,s over the question words. By aggregation, a new reasoning operation a t is generated, represented in terms of the question words.
Briefly, the new reasoning operation a t is modeled by a function f na (q, a t−1 , cw s ; θ na ), where θ na is a set of parameters, including

Reader Unit
As shown in Figure 4, reader unit retrieves relevant document content that is required for per- Figure 4: Flow chart of the reader unit forming the t th reasoning operation. The relevance is measured by the content context in a softattention manner, taking account of the current reasoning operation and prior memory. We do not rely on external tools to facilitate globally training.
To support transitive reasoning , we first extract the document content relevant to the preceding re- The relevance often indicates a contextual relation in the distributed space. For instance, given a question aspect why, the contents with causal relation are highly expected and their relevant score is likely to be large.
Then, dm t,h,j is independently incorporated with the document content ds h,j to produce This allows us to also consider new information which is not directly related to the prior intermediate result, so as to assist parallel and inductive reasoning.
Lastly, we use soft attention to select content that is relevant to the reasoning operation a t and candidate options o l | L l=1 . Precisely, we unify the a t and o l | L l=1 by a linear transformation to obtain oa t , i.e. W d×Ld oa [a t , o l | L l=1 ] + b d oa , where the options size L is fixed and predefined. We then measure the correlation between oa t and the extracted content dn t,h,j , passing the result through softmax layer to produce an attention distribution. By taking weighted average over the distribution, we can retrieve related content ri t by Eq. (3).
In short, the retrieved content ri t is formulated by a function f ri (m t−1 , ds h,j , a t , o l | L l=1 ; θ ri ), where θ ri is a parameter set, involving

Writer Unit
As illustrated in Figure 5, writer unit is responsible to compute the intermediate result on the t th (1) Motivated by the work on relational reasoning (Santoro et al., 2017), we linearly incorporate the retrieved content ri t , prior result m t−1 , and question q to get mc t = W d×3d mc [ri t , m t−1 , q] + b d mc , so as to measure their correlations.
(2) By considering non-sequential reasoning, such as tree or graph style, we refer to all previous memorized results instead of just the proceeding one m t−1 . Motivated by the work on scalable memory network (Miller et al., 2016), we compute the attention of the current operation a t against all previous ones a i | t−1 i=1 , yielding sa ti = sof tmax(W 1×d sa [a t a i ] + b 1 sa ). And then we average over the previous results m i | t−1 i=1 to get preceding relevant support as mp t , that is t−1 i=1 sa ti · m i . By combining mp t with correlated result mc t above, we can obtain a plausible result mu t , namely W d×d mp mp t + W d×d mc mc t + b d mu .
(3) The operations on some question aspects such as why need multi-step reasoning and updating while others no need. In order to regulate the valid reasoning space, an update gate is introduced to determine whether to refresh the previous result m t−1 in the memory by the new plausible result mu t . The gate α t is conditioned on the operation a t by using a learned linear transformation and a sigmoid function. If the gate is open, the unit updates the new result to the memory, otherwise, it skips this operation and performs the next one.
In brief, the new reasoning result m t is modeled by a function f nm (m t−1 , ri t , q, a t ; θ nm ), where θ nm is a parameter set, including (W d×3d mc , b d mc , W 1×d sa , b 1 sa , W d×d mp , W d×d mc , b d mu , W 1×d a , b 1 a ).

Output and Training
After the terminated condition is met, we can obtain the memory state m t−1 , which indicates the final intermediate result of the reasoning process. For the multi-choice questions focused in the paper, there is a fixed set of possible answers. We then leverage a classifier to predict an answer by referring to the question q and options o l | L l=1 . Precisely, we first measure the correlation of m t−1 against q and o l | L l=1 , to get m t−1 q, m t−1 o l . By concatenation, we pass the outcome through a 2-layer fully-connected softmax network to derive an answer option by Eq.(5), with ReLU activation function to alleviate over-fitting. In summary, the parameter set θ ans is (W d×2d

Reinforcement Learning
Due to the discrete of the termination steps, the proposed network could not be directly optimized by back-propagation. To facilitate training, a reinforcement approach is used by viewing the inference operations as policies, including the reasoning operation flow G 1:T , termination decision flow t 1:T and answer prediction A T , where T is the reasoning depth. Given i th training instance q i ; D i ; o i , the expected reward r is defined to be 1 if the predicted answer is correct, otherwise 0. The rewards on intermediate steps are 0, i.e. {r t = 0}| T −1 t=1 . Each probable value pair of (G; t; A) corresponds to an episode, where all possible episodes denote as A † . Let J(θ) = E π T t=1 r t be the total expected reward, where π(G, t, A; θ) is a policy parameterized by the network parameter θ, involving the encoding matrices θ W , question network θ na , termination gate θ ts , reader network θ ri , writer network θ nm , and answer network θ ans . To maximize the reward J, we explore gradient descent optimization, with Monte-Carlo RE-INFORCE (Williams, 1992) estimation by Eq.(6).
where b is a critic value function. It is usually set as (G,t,A) π(G, t, A; θ)r (Shen et al., 2016) and (r/b − 1) is often used instead of (r − b) to achieve stability and boost the convergence speed.

Evaluations
In this section, we extensively evaluate the effectiveness of our approach, including comparisons with state-of-the-arts, and components analysis.

Data and Experimental Setting
As shown in Table 1, experiments were conducted on 3 popular data sets in 9 domains, including MCTest, RACE and MultiRC. Different from data sets such as bAbI  that are synthetic, the questions in the evaluated data sets are high-quality to reflect real-world applications. Hyper-parameters were set as follows. For question encoding, the POS tags were obtained by using OpenNLP toolkit. Multiple cells were connected to form the network, where the cells were weight sharing. The maximum size of connected cells length was 16. The network was optimized via Adam (Kingma and Ba, 2014) with a learning rate of 10 −4 and a batch size of 64. We used gradient clipping with clipnorm of 8, and employed early stopping based on the validation accuracy. For word embedding, we leveraged 300-dimension pre-trained word vectors from GloVe, where the word embeddings were initialized randomly using a standard uniform distribution and not updated during training. The out-of-vocabulary words were initialized with zero vectors. The number of hidden units in GRU was set to 256, and the recurrent weights were initialized by random orthogonal matrices. The other weights in GRU were initialized from a uniform distribution between −0.01 and 0.01. We maintained the exponential moving averages on the model weights with a decay rate of 0.999, and used them at test time instead of the raw weights. Variational dropout of 0.15 was used across the network and maximum reasoning step was set to 5. Training usually converged within 30 epochs.

Comparisons with the State-of-the-Arts
We compared our approach with all published baselines at the time of submission on the evaluated data sets. The baselines were summarized as follows. (1) On RACE data set, six baselines were employed, including three introduced in the release of the data set, that is Sliding Window (Richardson et al., 2013), Stanford AR , and GA (Dhingra et al., 2016); another three methods proposed recently, namely DFN (Xu et al., 2017), BiAttention 250d MRU (Tay et al., 2018), and OFT (Radford et al., 2018).
(2) For MCTest data set, nine baselines were investigated, involving four on lexical matching, i.e. RTE, SWD, RTE+SWD Richardson et al. (2013), Linguistic (Smith et al., 2015); two methods used hidden alignment, that is Discourse (Narasimhan and Barzilay, 2015), Syntax (Wang et al., 2015); three approaches based on deep learning, i.e. EK , PH (Trischler et al., 2016b), and HV (Li et al., 2018a). (3) Regarding multi-choices questions in MultiRC data set, we replace softmax to sigmoid at the answer generation layer, so as to make prediction on each option. Accordingly, five baselines were exploited, including three used in the release of the data set, that is IR, SurfaceLR, and LR (Khashabi et al., 2018); two methods currently composed, namely OFT (Radford et al., 2018) and Strategies (Sun et al., 2018a).
As elaborated in Figure 6, our approach outperformed the individual baselines on all three data sets 1 . Specifically, for RACE data set, our approach achieved the best performance and outperformed the second one (i.e. OFT) in terms of average accuracy by over 4.12%, 5.00% on RACE-M and RACE-H, respectively. On MCTest data set, the outperformance was 5.55%, 7.14% over PH baseline which was the second best on MC160multi and MC500-multi, respectively, where multi is a subset of the data set that is more difficult and needs understanding multiple sentences to answer. For MultiRC data set, our approach led to a performance boost against the second best one (i.e. Strategies) in terms of macro-average F1 by over 4.06%, while in terms of micro-average Figure 6: Comparisons of our approach against stateof-the-arts on the RACE, MCTest, and MultiRC data sets respectively. Statistical significant with p-values<0.01 using two-tailed paired test F1 and exact match accuracy by over 5.20% and 6.64%, respectively. Such results showed that our approach with structural multi-step design and context aware inference can correctly answer the questions, especially the non-trivial ones required reasoning, thus boost the overall performance.

Ablations Studies
To gain better insight into the relative contributions of various components in our approach, empirical ablation studies were performed on seven aspects, including (1) position and POS aware embedding on the question; (2) structural selfattention in document encoding; (3) two in the master unit, that is, guiding the reasoning operation by previous memory result, and casting back to original question words; (4) extract-ing relevant content based on preceding memory result in reader unit; (5) two in the writer unit, namely, non-sequential reasoning and updating gate mechanisms. They were denoted as pos aware, doc self att, rsn prior mem, que w reg, prior mem res, non seq rsn, and udt gate, respectively. Figure 7: Ablation studies on various components of our approach for affecting the performance As displayed in Figure 7, the ablation on all evaluated components in our approach led to the performance drop. The drop was more than 10% on four components, including (1) rsn prior mem; Lack of the memory guidance, the inferred result from previous step could not be served as context for the next. Losing such valuable context may lead to the misalignment of the reasoning chain.
(2) prior mem res; Discard of the preceding memory result, the relevant content with contextual relations would not be identified. Such relations are the key for transitive reasoning. (3) que w reg; Without casting back to original question words, Figure 8: Evaluation on the termination mechanism it is equivalent to processing the complex question at one step without identifying the details. Such coarse-grained processing fails to effectively regulate the space of the valid reasoning operations, and may confuse the reasoning direction. (4) udt gate; The gate could help balance the complex and simple questions, and reduce long-range dependencies in the reasoning process by skipping, which would improve performance. These results further convinced us on the significant value of imposing strong structural priors to help the network derive the evidence chain from text.
Furthermore, we evaluated the efficiency of the termination mechanism by replacing it with fixed steps from 1 up to 5. The results on the RACE and MCTest data sets showed the replacement would lead to drop on average accuracy and slowdown on the convergence rate. As demonstrated in Figure  8, for fixed size reasoning, more steps performed well at first, but deteriorated soon, while dynamic strategy can adaptively determine the optimal termination, that may help boost the accuracy.

Case Study
Due to the use of soft attention, the proposed network offers a traceable reasoning path which can interpret the generation of the answer based on the attended words. To better understand the reasoning behavior, we plotted the attention map over the document, question and options in Figure 9 with respect to the sample on Figure 1. From the sequence of the maps, we observed that the network adaptively decided which part of an input question should be analyzed at each hop. For example, it first focused on the question aspect "some newspapers refused delivery to distant suburbs." Then it generated evidence attended at S5 regarding to the focused aspect by similarity. Subsequently, the aspect "why" was focused and evidence attended at S4 was identified. We may infer that since  S4 and previous intermediate result S5 contain the explainable relation, they would most likely be correlated in the distributed space with sentencelevel context aware encoding. Later, "why" was re-focused, the evidence attended at S3 was derived. Finally, option B was attended and the process ended due to termination unit may be triggered to work. Such results showed the network can derive the answer by capturing underlying semantics of the question and sequentially traversing the relations on document based on the context.

Related Work
Earlier studies on machine comprehension mainly focused on the text span selection question. It is often transformed into a similarity matching problem and solved by feature engineeringbased methods (Smith et al., 2015) or deep neural networks. The classical features include lexical features (e.g. overlapping of words, Ngram, POS tagging) (Richardson et al., 2013), syntactic features (Wang et al., 2015), discourse features (Narasimhan and Barzilay, 2015), etc. Besides, the typical networks involve Stanford AR , AS Reader (Kadlec et al., 2016), BiDAF (Seo et al., 2016), Match-LSTM (Wang and Jiang, 2017), etc, which used distributed vectors rather than discrete features to better compute the contextual similarity.
To support inference, existing models can be classified into three categories, including predicate based methods (Richardson and Domingos, 2006), rule-based methods relied on external parser (Sun et al., 2018b) or pre-built tree (Yu et al., 2012), and multi-layer memory networks (Hill et al., 2015), such as gated attended net (Dhingra et al., 2016), double-sided attended net (Cui et al., 2016), etc. These models either lack end-to-end design for global training, or no prior structure to subtly guide the reasoning direction. On the topic of multi-hop reasoning, current models often have to rely on the predefined graph constructed by external tools, such as interpretable network (Zhou et al., 2018) on knowledge graph. The graph plainly links the facts, from which the intermediate re-sult in the next hop can be directly derived. However, in this paper, the evidence graph is not explicitly given by embodied in the text semantics.
Another related works are on Visual QA, aiming to answer the compositional questions with regards to a given image, such as "What color is the matte thing to the right of the sphere in front of the tiny blue block?" In particular, Santoro et al. (2017) proposed a relation net, yet the net was restricted to relational question, such as comparison. Later, Hudson and Manning (2018) introduced an iterative network. The network separated memory and control to improve interpretability. Our work leverages such separated design. Different from previous researches, we dedicate to inferential machine comprehension, where the question may not be compositional, such as why question, but requires reasoning on an unknown evidence chain with uncertain depth. The chain has to be inferred from the text semantics. To the best of our knowledge, no previous studies have investigated an end-to-end approach to address this problem.

Conclusions and Future Works
We have proposed a network to answer generic questions, especially the ones needed reasoning. We decomposed the inference problem into a series of atomic steps, where each was executed by the operation cell designed with prior structure. Multiple cells were recursively linked to produce an evidence chain in a multi-hop manner. Besides, a terminated gate was presented to dynamically determine the uncertain reasoning depth and a reinforcement method was used to train the network. Experiments on 3 popular data sets demonstrated the efficiency of the approach. Such approach is mainly applied to multiple-choice questions now. In the future, we will expand it to support the questions on text span selection by using the relation type rather than the option as the terminated condition. For example, given the why question, reasoning process should be stopped when unrelated relation is met, such as transitional relation.