Discern: Discourse-Aware Entailment Reasoning Network for Conversational Machine Reading

Document interpretation and dialog understanding are the two major challenges for conversational machine reading. In this work, we propose Discern, a discourse-aware entailment reasoning network to strengthen the connection and enhance the understanding for both document and dialog. Specifically, we split the document into clause-like elementary discourse units (EDU) using a pre-trained discourse segmentation model, and we train our model in a weakly-supervised manner to predict whether each EDU is entailed by the user feedback in a conversation. Based on the learned EDU and entailment representations, we either reply to the user our final decision"yes/no/irrelevant"of the initial question, or generate a follow-up question to inquiry more information. Our experiments on the ShARC benchmark (blind, held-out test set) show that Discern achieves state-of-the-art results of 78.3% macro-averaged accuracy on decision making and 64.0 BLEU1 on follow-up question generation. Code and models are released at https://github.com/Yifan-Gao/Discern.


Introduction
Conversational Machine Reading (CMR) is challenging because the rule text may not contain the literal answer, but provide a procedure to derive it through interactions (Saeidi et al., 2018). In this case, the machine needs to read the rule text, interpret the user scenario, clarify the unknown user's background by asking questions, and derive the final answer. Taking Figure 1 as an example, to answer the user whether he is suitable for the loan program, the machine needs to interpret the rule text to know what are the requirements, understand he meets "American small business" from the user scenario, ask follow-up clarification questions about "for-profit business" and "not get financing Rule Text: 7(a) loans are the most basic and most used type loan of the Small Business Administration's (SBA) business loan programs. It's name comes from section 7(a) of the Small Business Act, which authorizes the agency to provide business loans to American small businesses. The loan program is designed to assist for-profit businesses that are not able to get other financing from other resources. User Scenario: I am a 34 year old man from the United States who owns their own business. We are an American small business. User Question: Is the 7(a) loan program for me? Follow-up Q1: Are you a for-profit business? Follow-up A1: Yes. Follow-up Q2: Are you able to get financing from other resources? Follow-up A2: No. Final Answer: Yes. (You can apply the loan.) Figure 1: An example dialog from the ShARC (Saeidi et al., 2018) dataset. The machine answers the user question by reading the rule text, interpreting the user scenario, and keeping asking follow-up questions to clarify the user's background until it concludes a final answer. Requirements in the rule text are bold. from other resources", and finally it concludes the answer "Yes" to the user's initial question.
Existing approaches (Zhong and Zettlemoyer, 2019;Sharma et al., 2019;Gao et al., 2020) decompose this problem into two sub-tasks. Given the rule text, user question, user scenario, and dialog history (if any), the first sub-task is to make a decision among "Yes", "No", "Inquire" and "Irrelevant". The "Yes/No" directly answers the user question and "Irrelevant" means the user question is unanswerable by the rule text. If the user-provided information (user scenario, previous dialogs) are not enough to determine his fulfillment or eligibility, an "Inquire" decision is made and the second sub-task is activated. The second sub-task is to capture the underspecified condition from the rule text and generate a follow-up question to clarify it. Zhong and Zettlemoyer (2019) adopt BERT (De-vlin et al., 2019) to reason out the decision, and propose an entailment-driven extracting and editing framework to extract a span from the rule text and edit it into the follow-up question. The current state-of-the-art model EMT (Gao et al., 2020) uses a Recurrent Entity Network (Henaff et al., 2017) with explicit memory to track the fulfillment of rules at each dialog turn for decision making and question generation.
In this problem, document interpretation requires identification of conditions and determination of logical structures because rules can appear in the format of bullet points, in-line conditions, conjunctions, disjunctions, etc. Hence, correctly interpreting rules is the first step towards decision making. Another challenge is dialog understanding. The model needs to evaluate the user's fulfillment over the conditions, and jointly consider the fulfillment states and the logical structure of rules for decision making. For example, disjunctions and conjunctions of conditions have completely different requirements over the user's fulfillment states. However, existing methods have not considered condition-level understanding and reasoning.
In this work, we propose DISCERN: Discourse-Aware Entailment Reasoning Network . To better understand the logical structure of a rule text and to extract conditions from it, we first segment the rule text into clause-like elementary discourse units (EDUs) using a pre-trained discourse segmentation model (Li et al., 2018). Each EDU is treated as a condition of the rule text, and our model estimates its entailment confidence scores over three states: ENTAILMENT, CONTRADICTION or NEUTRAL by reading the user scenario description and existing dialog. Then we map the scores to an entailment vector for each condition, and reason out the decision based on the entailment vectors and the logical structure of rules. Compared to previous methods that do little entailment reasoning (Zhong and Zettlemoyer, 2019) or use it as multi-task learning (Gao et al., 2020), DISCERN is the first method to explicitly build the dependency between entailment states and decisions at each dialog turn.
DISCERN achieves new state-of-the-art results on the blind, held out test set of ShARC (Saeidi et al., 2018). In particular, DISCERN outperforms the previous best model EMT (Gao et al., 2020) by 3.8% in micro-averaged decision accuracy and 3.5% in macro-averaged decision accuracy. Specifically, DISCERN performs well on simple in-line conditions and conjunctions of rules while still needing improvements on understanding disjunctions. Finally, we conduct comprehensive analyses to unveil the limitation of DISCERN and current challenges for the ShARC benchmark. We find one of the biggest bottlenecks is the user scenario interpretation, in which various types of reasoning are required.

DISCERN Model
DISCERN answers the user question through a three-step process shown in Figure 2: 1. First, DISCERN segments the rule text into individual conditions using discourse segmentation.
2. Taking the user-provided information including the user question, user scenario and dialog history as inputs, DISCERN predicts the entailment state and maps it to an entailment vector for each segmented condition. Then it reasons out the decision by considering the logical structure of the rule text and the fulfillment of each condition.
3. Finally, if the decision is "Inquire", DISCERN generates a follow-up question to clarify the underspecified condition in the rule text.

Rule Segmentation
The goal of rule segmentation is to understand the logical structure of the rule text and parse it into individual conditions for the ease of entailment reasoning. Ideally, each segmented unit should contain at most one condition. Otherwise, it will be ambiguous to determine the entailment state for that unit. Determining conditions is easy when they appear as bullet points, but in most cases (65% samples in the ShARC dataset), one rule sentence may contain several in-line conditions as exemplified in Figure 2. To extract these in-line conditions, we find discourse segmentation in discourse parsing to be useful. In the Rhetorical Structure Theory or RST (Mann and Thompson, 1988) of discourse parsing, texts are first split into a sequence of clause-like units called elementary discourse units (EDUs). We utilize an off-the-shelf discourse segmenter (Li et al., 2018) to break the rule text into a sequence of EDUs. The segmenter uses a pointer network and achieves 92.2% F-score with Glove vectors and 95.55% F-score with ELMo embeddings on the standard RST benchmark testset, Step ① QA 1 QA … Step ② Step ③ [If a worker has taken more leave than they're entitled to,] EDU1 [their employer must not take money from their final pay ] EDU2 [unless it's been agreed beforehand in writing.] EDU3 mapping Rule Text: If a worker has taken more leave than they're entitled to, their employer must not take money from their final pay unless it's been agreed beforehand in writing.

Discourse Segmentation
User Scenario User Question Figure 2: The overall diagram of our proposed DISCERN. DISCERN first segments the rule text into several elementary discourse units (EDUs) as conditions (Section 2.1). Then, taking the segmented conditions, user question, user scenario, and dialog history as inputs, DISCERN reasons out the decision among "Yes", "No", "Irrelevant" and "Inquire" (Section 2.2). If the decision is "Inquire", the question generation model asks a follow-up question (Section 2.

3). (Best viewed in color)
which is close to human agreement of 98.3% Fscore (Joty et al., 2015;Lin et al., 2019b). As exemplified in Figure 2 Step 1 , the rule sentence is broken into three EDUs, in which two conditions ("If a worker has taken more leave than they're entitled to", "unless it's been agreed beforehand in writing") and the outcome ("their employer must not take money from their final pay") are split out precisely. For rule texts which contain bullet points, we directly treat these bullet points as conditions.

Decision Making via Entailment Reasoning
Encoding. As shown in Figure 2 Step 2 , inputs to DISCERN include the segmented conditions (EDUs) in the rule text, user question, user sce-nario, and follow-up question-answer pairs in dialog history, each of which is a sequence of tokens. In order to get the sentence-level representations for all individual sequences, we insert an external [CLS] symbol at the start of each sequence, and add a [SEP] symbol at the end of every type of inputs. Then, DISCERN concatenates all sequences together, and uses RoBERTa  to encode the concatenated sequence. The encoded [CLS] token represents the sequence that follows it. In this way, we extract sentence-level representations of conditions (EDUs) as e 1 , e 2 , ..., e N , and also the representations of the user question u Q , user scenario u S , and M turns of dialog history u 1 , ..., u M . All these vectorized representations are of d dimensions (768 for RoBERTa-base).
Entailment Prediction. In order to reason out the correct decision for the user question, it is necessary to figure out the fulfillment of conditions in the rule text. We propose to formulate the fulfillment prediction of conditions into a multi-sentence entailment task. Given a sequence of conditions (premises) and a sequence of user-provided information (hypotheses), a system should output EN-TAILMENT, CONTRADICTION or NEUTRAL for each condition listed in the rule text. In this context, NEUTRAL indicates that the condition has not been mentioned from the user information.
We utilize an inter-sentence transformer encoder (Vaswani et al., 2017) to predict the entailment states for all conditions simultaneously. Taking all sentence-level representations [e 1 ; e 2 ; ...; e N ; u Q ; u S ; u 1 ; ...; u M ] as inputs, the L-layer transformer encoder makes each condition attend to all the user-provided information to predict whether the condition is entailed or not. We also allow all conditions can attend to each other to understand the logical structure of the rule text.
Let the transformer encoder output of the i-th condition asẽ i , we use a linear transformation to predict its entailment state: where c i = [c E,i , c C,i , c N,i ] ∈ R 3 contains confidence scores of three entailment states ENTAIL-MENT, CONTRADICTION, NEUTRAL for the i-th condition in the rule text.
Since there are no ground truth entailment labels for individual conditions, we adopt a heuristic approach similar to Gao et al. (2020) to get the noisy supervision signals. Given the rule text, we first collect all associated follow-up questions in the dataset. Each follow-up question is matched to a segmented condition (EDU) in the rule text which has the minimum edit distance. For conditions in the rule text which are mentioned by follow-up questions in the dialogue history, we label the entailment state of a condition as Entailment if the answer for its mentioned follow-up question is Yes, and label the state of this condition as Contradiction if the answer is No. The remaining conditions not covered by any follow-up question are labeled as Neutral. Let r indicate the correct entailment state. The entailment prediction is weakly supervised by the following cross entropy loss, normalized by total number of K conditions in a batch: Decision Making. After knowing the entailment state for each condition in the rule text, the remaining challenge for decision making is to perform logical reasoning over different rule types such as disjunction, conjunction, and conjunction of disjunctions. To achieve this, we first design three Neutral), and map the predicted entailment confidence scores of each condition to its vectorized entailment representation: These entailment vectors are randomly initialized and then learned during training. Finally, DISCERN jointly considers the logical structure of rulesẽ i and the entailment representations V EDU,i of conditions to make a decision: where [V EDU,i ;ẽ i ] denotes the vector concatenation, α i is the attention weight for the i-th condition that determines whether the i-th condition should be taken into consideration for the final decision. z ∈ R 4 contains the predicted scores for all four possible decisions "Yes", "No", "Inquire" and "Irrelevant". Let l indicate the correct decision, z is supervised by the following cross entropy loss: The overall loss for the Step 2 decision making is the weighted-sum of decision loss and entailment prediction loss:

Follow-up Question Generation
If the predicted decision is "Inquire", the follow-up question generation model is activated, as shown in Step 3 of Figure 2. It extracts an underspecified span from the rule text which is uncovered from the user's feedback, and rephrases it into a wellformed question. Existing approaches put huge efforts in extracting the underspecified span, such as entailment-driven extracting and ranking (Zhong and Zettlemoyer, 2019) or coarse-to-fine reasoning (Gao et al., 2020). However, we find that such sophisticated modelings may not be necessary, and we propose a simple but effective approach here. We split the rule text into sentences and concatenate the rule sentences and user-provided information into a sequence. Then we use RoBERTa to encode them into vectors grounded to tokens, as here we want to predict the position of a span within the rule text. Let [t 1,1 , ..., t 1,s 1 ; t 2,1 , ..., t 2,s 2 ; ...; t N,1 , ..., t N,s N ] be the encoded vectors for tokens from N rule sentences, we follow the BERTQA approach (Devlin et al., 2019) to learn a start vector w s ∈ R d and an end vector w e ∈ R d to locate the start and end positions, under the restriction that the start and end positions must belong to the same rule sentence: where i, j denote the start and end positions of the selected span, and k is the sentence which the span belongs to. The training objective is the sum of the log-likelihoods of the correct start and end positions. To supervise the span extraction process, the noisy supervision of spans are generated by selecting the span which has the minimum edit distance with the to-be-asked question. Lastly, following Gao et al. (2020), we concatenate the rule text and span as the input sequence, and finetune UniLM (Dong et al., 2019), a pre-trained language model to rephrase it into a question.

Experimental Setup
Dataset. ShARC (Saeidi et al., 2018) dataset is the current benchmark to test entailment reasoning in conversational machine reading 1 . The dataset contains 948 rule texts clawed from 10 government websites, in which 65% of them are plain text with in-line conditions while the rest 35% contain bulletpoint conditions. Each rule text is associated with a dialog tree (follow-up QAs) that considers all possible fulfillment combinations of conditions. In the data annotation stage, parts of the dialogs are paraphrased into the user scenario. These parts of dialogs are marked as evidence which should be extracted (entailed) from the user scenario, and are not provided as inputs for evaluation. The inputs to the system are the rule text, user question, user scenario, and dialog history (if any). The output is the answer among Yes, No, Irrelevant, or a follow-up question. The train, development, and test dataset sizes are 21890, 2270, and 8276, respectively.
Evaluation Metrics. The decision making subtask uses macro-and micro-accuracy of four classes "Yes", "No", "Irrelevant", "Inquire" as metrics. For the question generation sub-task, we evaluate models under both the official end-to-end setting (Saeidi et al., 2018) and the recently proposed oracle setting (Gao et al., 2020). In the official setting, the BLEU score (Papineni et al., 2002) is calculated only when both the ground truth decision and the predicted decision are "Inquire", which makes the score dependent on the model's "Inquire" predictions. For the oracle question generation setting, models are asked to generate a question when the ground truth decision is "Inquire".
Implementation Details. For the decision making sub-task, we finetune RoBERTa-base model (Wolf et al., 2019) with Adam (Kingma and Ba, 2015) optimizer for 5 epochs with a learning rate of 5e-5, a warm-up rate of 0.1, a batch size of 16, and a dropout rate of 0.35. The number of inter-sentence transformer layers L and the loss weight λ for entailment prediction are hyperparameters. We try 1,2,3 for L and 1.0, 2.0, 3.0, 4.0, 5.0 for λ, and find the best combination is L = 2, λ = 3.0, based on the development set results. For the question generation sub-task, we train a RoBERTa-base model to extract spans under the same training scheme above, and finetune UniLM (Dong et al., 2019) 20 epochs for question rephrasing with a batch size of 16, a learning rate of 2e-5, and a beam size 10 for decoding in the inference stage. We repeat 5 times with different random seeds for all experiments on the development set and report the average results along with their standard deviations. It takes two hours for training on a 4-core server with an Nvidia GeForce GTX Titan X GPU.

Results
Decision Making Sub-task. The decision making results in macro-and micro-accuracy on the blind, held out test set of ShARC are shown in  Table 2: Class-wise decision prediction accuracy among "Yes", "No", "Inquire" and "Irrelevant" on the development set of ShARC. model EMT (Gao et al., 2020) by 3.8% in microaveraged accuracy and 3.5% in macro-averaged accuracy. We further analyze the class-wise decision prediction accuracy on the development set of ShARC in Table 2, and find that DISCERN have far better predictions than all existing approaches whenever a decision on the user's fulfillment is needed ("Yes", "No", "Inquire"). It is because the predicted decisions from DISCERN are made upon the predicted entailment states while previous approaches do not build the connection between them. Question Generation Sub-task. DISCERN outperforms existing methods under both the official end-to-end setting (Table 1) and the recently proposed oracle setting (Table 3). Because the comparison among models is only fair under the oracle question generation setting (Gao et al., 2020), we compare DISCERN with E 3 (Zhong and Zettlemoyer, 2019), E 3 +UniLM (Gao et al., 2020), EMT (Gao et al., 2020), and our ablation DISCERN (BERT) in Table 3. Interestingly, we find that, in this oracle setting, our proposed simple approach is even better than previous sophisticated models such as E 3 and EMT which jointly learn question generation and decision making via multi-task learning. From our results and investigations, we believe  the decision making sub-task and the follow-up question generation sub-task do not share too many commonalities so the results are not improved for each task in their multi-task training. On the other hand, our question generation model is easy to optimize because this model is separately trained from the decision making one, which means there is no need to balance the performance between these two sub-tasks. Besides, RoBERTa backbone performs comparably with its BERT counterpart. In our detailed analyses, we find DISCERN can locate the next questionable sentence with 77.2% accuracy, which means DISCERN utilizes the user scenario and dialog history well to locate the next underspecified condition. We try to add entailment prediction supervision to help DISCERN to locate the unfulfilled condition but it does not help. We also try to simplify our approach by directly finetuning UniLM to learn the mapping between concatenated input sequences and the follow-up clarification questions. However, the poor result (around 40 for BLEU1) suggests this direction still remains further investigations.  RoBERTa vs. BERT. DISCERN (BERT) replaces the RoBERTa backbone with BERT while other modules remain the same. The better performance of RoBERTa backbone matches findings from Talmor et al. (2019), which indicate that RoBERTa can capture negations and handle conjunctions of facts better than BERT.

Ablation Study
Discourse Segmentation vs. Sentence Splitting. DISCERN (w/o EDU) replaces the discourse segmentation based rule parsing with simple sentence splitting, and we observe there is a 1.63% drop on the micro-accuracy. This is intuitive because we observe 65% of the rule texts in the training set contains in-line conditions. To better understand the effect of discourse segmentation, we also  Presumably, the condition representations account for the logical forms of rule texts and entailment vectors contain the fulfillment states for these conditions.

Analysis of Logical Structure of Rules
To see how DISCERN understands the logical structure of rules, we evaluate the decision making accuracy according to the logical types of rule texts.
Here we define four logical types: "Simple", "Conjunction", "Disjunction", "Other", which are inferred from the associated dialog trees. "Simple" means there is only one requirement in the rule text while "Other" denotes the rule text have complex logical structures, for example, a conjunction of disjunctions or a disjunction of conjunctions. Table 5 shows decision prediction results categorized by different logical structures of rules. DISCERN achieves the best performance on the "Simple" logical type which only needs to determine the single condition is satisfied or not. On the other hand, DISCERN does not perform well on rules in the format of disjunctions. We conduct further analysis on this category and find that the error comes from user scenario interpretation: the user has already provided his fulfillment in the user scenario but DISCERN fails to extract it. Detailed analyses are further conducted in the following section.

How Far Has the Problem Been Solved?
In order to figure out the limitations of DISCERN, and the current challenges of ShARC CMR, we disentangle the challenges of scenario interpretation and dialog understanding in ShARC by selecting different subsets, and evaluate decision making and entailment prediction accuracy on them.
Baseline. Because the classification for unanswerable questions ("irrelevant" class) is nearly solved (99.3% in   Table 6 ("Scenario Subset") show that interpreting scenarios to extract the entailment information within is exactly the current bottleneck of DISCERN. We analyze 100 error cases on this subset and find that various types of reasoning are required for scenario interpretation, including numerical reasoning (15%), temporal reasoning (12%), and implication over common sense and external knowledge (46%). Besides, DIS-CERN still fails to extract user's fulfillment when the scenarios paraphrase the rule texts (27%). Examples for each type of error are shown in Figure  3. Among three classes of entailment states, we find that DISCERN fails to predict ENTAILMENT or CONTRADICTION precisely -it predicts NEUTRAL in most cases for scenario interpretation, resulting in high micro-accuracy in entailment prediction but the macro-accuracy is poor. The decision accuracy is subsequently hurt by the entailment results.
ShARC (Evidence). Based on the above observation, we replace the user scenario in the ShARC (Answerable) by its evidence and re-evaluate the overall performance on these answerable questions. As described in Section 3.1 Dataset, the evidence is the part of dialogs that should be entailed from the user scenario. Table 6 shows that the model improves 11.38% in decision making micro-accuracy if no scenario interpretation is required, which validates our above observation.

Related Work
Entailment Reasoning in Reading Comprehension. Understanding entailments (or implications) of text is essential in dialog and question answering systems. ROPES (Lin et al., 2019a) requires reading descriptions of causes and effects and applying them to situated questions, while ShARC (Saeidi et al., 2018), the focus of DISCERN, requires to understand rules and apply them to questions asked by users in a conversational manner. Most existing methods simply use BERT to classify the answer without considering the structures of rule texts (Zhong and Zettlemoyer, 2019;Sharma et al., 2019;Lawrence et al., 2019). Gao et al. (2020) propose Explicit Memory Tracker (EMT), which firstly addresses entailment-oriented reasoning. At each dialog turn, EMT recurrently tracks whether conditions listed in the rule text have already been satisfied to make a decision.
In this paper, we also explicitly model entailment reasoning for decision making, but there are three key differences between our DISCERN and EMT: (1) we apply discourse segmentation to parse the rule text, which is extremely helpful because there are many in-line conditions in rules; (2) Our stacked inter-sentence transformer layers extract better features for entailment prediction, which could be seen as a generalization of their recurrent explicit memory tracker. (3) Different from their utilization of entailment prediction which is treated as multi-task learning for decision making, we directly build the dependency between entailment prediction states and the predicted decisions.
Discourse Applications. Discourse analysis uncovers text-level linguistic structures (e.g., topic, coherence, co-reference), which can be useful for many downstream applications, such as coherent text generation (Bosselut et al., 2018) and text sum-  marization (Joty et al., 2019;Cohan et al., 2018;Xu et al., 2020). Recently, discourse information has also been introduced in neural reading comprehension. Mihaylov and Frank (2019) design a discourse-aware semantic self-attention mechanism to supervise different heads of the transformer by discourse relations and coreferring mentions. Different from their use of discourse information, we use it as a parser to segment surface-level in-line conditions for entailment reasoning.

Conclusion
In this paper, we present DISCERN, a system that does discourse-aware entailment reasoning for conversational machine reading. DISCERN explicitly builds the connection between entailment states of conditions and the final decisions. Results on the ShARC benchmark shows that DISCERN outperforms existing methods by a large margin. We also conduct comprehensive analyses to unveil the limitations of DISCERN and challenges for ShARC. In future, we plan to explore how to incorporate discourse parsing into the current decision making model for end-to-end learning. One possibility would be to frame them as multi-task learning with a common (shared) encoder. Another direction is leveraging current methods in question generation Li et al., 2019) to improve the follow-up question generation sub-task since DIS-CERN is on par with the previous best model EMT.