Explicit Memory Tracker with Coarse-to-Fine Reasoning for Conversational Machine Reading

The goal of conversational machine reading is to answer user questions given a knowledge base text which may require asking clarification questions. Existing approaches are limited in their decision making due to struggles in extracting question-related rules and reasoning about them. In this paper, we present a new framework of conversational machine reading that comprises a novel Explicit Memory Tracker (EMT) to track whether conditions listed in the rule text have already been satisfied to make a decision. Moreover, our framework generates clarification questions by adopting a coarse-to-fine reasoning strategy, utilizing sentence-level entailment scores to weight token-level distributions. On the ShARC benchmark (blind, held-out) testset, EMT achieves new state-of-the-art results of 74.6% micro-averaged decision accuracy and 49.5 BLEU4. We also show that EMT is more interpretable by visualizing the entailment-oriented reasoning process as the conversation flows. Code and models are released at https://github.com/Yifan-Gao/explicit_memory_tracker.


Introduction
In conversational machine reading (CMR), machines can take the initiative to ask users questions that help to solve their problems, instead of jumping into a conclusion hurriedly (Saeidi et al., 2018). In this case, machines need to understand the knowledge base (KB) text, evaluate and keep track of the user scenario, ask clarification questions, and then make a final decision. This interactive behavior between users and machines has gained more attention recently because in practice users are unaware of the KB text, thus they cannot provide all the information needed in a single turn. * This work was mostly done when Yifan Gao was an intern at Salesforce Research Asia, Singapore.

Statutory Maternity Pay
To qualify for SMP you must: * earn on average at least £113 a week * give the correct notice * give proof you're pregnant No ## Taking more leave than the entitlement If a worker has taken more leave than they're entitled to, their employer must not take money from their final pay unless it's been agreed beforehand in writing. The rules in this situation should be outlined in the employment contract, company handbook or intranet site.
Can my employer take money from my final pay?

I have questions regarding my employer …
Did you take more leave than they 're entitled to?  (Saeidi et al., 2018). At each turn, given the rule text, a user scenario, an initial user question, and previous interactions, a machine can give a certain final answer such as Yes or No to the initial question. If the machine cannot give a certain answer because of missing information from the user, it will ask a clarification question to fill in the information gap. Clarification questions and their corresponding rules are marked in the same colors.
For instance, consider the example in Figure 1 taken from the ShARC dataset for CMR (Saeidi et al., 2018). A user posts her scenario and asks a question on whether her employer can take money from her final pay. Since she does not know the relevant rule text, the provided scenario and the initial question(s) from her are often too underspecified for a machine to make a certain decision. Therefore, a machine has to read the rule text and ask a series of clarification questions until it can conclude the conversation with a certain answer.
Most existing approaches (Zhong and Zettlemoyer, 2019;Sharma et al., 2019) formalize the CMR problem into two sub-tasks. The first is to make a decision among Yes, No, Irrelevant, and Inquire at each dialog turn given a rule text, a user scenario, an initial question and the current dialog history. If one of Yes, No, or Irrelevant is selected, it implies that a final decision (Yes/No) can be made in response to the user's initial question, or stating the user's initial question is unanswerable (Irrelevant) according to the rule text. If the decision at the current turn is Inquire, it will then trigger the second task for follow-up question generation, which extracts an underspecified rule span from the rule text and generates a follow-up question accordingly.
However, there are two main drawbacks to the existing methods. First, with respect to the reasoning of the rule text, existing methods do not explicitly track whether a condition listed in the rule has already been satisfied as the conversation flows so that it can make a better decision. Second, with respect to the extraction of question-related rules, it is difficult in the current approach to extract the most relevant text span to generate the next question. For example, the state-of-the-art E 3 model (Zhong and Zettlemoyer, 2019) has only 60.6% F1 for question-related span extraction.
To address these issues, we propose a new framework of conversational machine reading with a novel Explicit Memory Tracker (EMT), which explicitly tracks each rule sentence to make decisions and generate follow-up questions. Specifically, EMT first segments the rule text into several rule sentences and allocates them into its memory. Then the initial question, user scenario, and dialog history are fed into EMT sequentially to update each memory module separately. At each dialog turn, EMT predicts the entailment states (satisfaction or not) for every rule sentence, and makes a decision based on the current memory status. If the decision is Inquire, EMT extracts a rule span to generate a follow-up question by adopting a coarseto-fine reasoning strategy (i.e., weighting tokenlevel span distributions with its sentence-level entailment scores). Compared to previous methods which only consider entailment-oriented reasoning for decision making or follow-up question generation, EMT utilizes its updated memory modules to reason out these two tasks in a unified manner.
We compare EMT with the existing approaches on the ShARC dataset (Saeidi et al., 2018). Our results show that explicitly tracking rules with external memories boosts both the decision accuracy and the quality of generated follow-up questions. In particular, EMT outperforms the previous best model E 3 by 1.3 in macro-averaged decision accuracy and 10.8 in BLEU4 for follow-up question generation. In addition to the performance improvement, EMT yields interpretability by explicitly tracking rules, which is visualized to show the entailment-oriented reasoning process of our model.

Method
As illustrated in Figure 2, our proposed method consists of the following four main modules.
(1) The Encoding module uses BERT (Devlin et al., 2019) to encode the concatenation of the rule text, initial question, scenario and dialog history into contextualized representations.
(2) The Explicit Memory Tracking module sequentially reads the initial question, user scenario, multi-turn dialog history, and updates the entailment state of each rule sentence.
(3) The Decision Making module does entailmentoriented reasoning based on the updated states of rule sentences and makes a decision among Yes, No, Irrelevant, and Inquire. (4) If the decision is Inquire, the Question Generation module is activated, which reuses the updated states of rule sentences to identify the underspecified rule sentence and extract the most informative span within it in a coarseto-fine manner. Then it rephrases the extracted span into a well-formed follow-up question.  to encode the sequence into a sequence of vectors with the same length. We treat each [CLS] representation as feature representation of the sentence that follows it. In this way, we receive both token-level representation and sentence-level representation for each sentence. We denote sentence-level representation of the rule sentences as k 1 , ..., k M and their token-level representation as [(u 1,1 , ..., u 1,n 1 ), ..., (u M,1 , ..., u M,n M )], where n i is number of tokens for rule sentence i. Similarly, we denote the sentence-level representation of the initial question, user scenario, and P turns of dialog history as s Q , s S , and s 1 , ..., s P , respectively. All these vectorized representations are of d dimensions (768 for BERT-base).

Explicit Memory Tracking
Given the rule sentences k 1 , ..., k M and the user provided information including the initial question s Q , scenario s S , and P turns of dialog history s 1 , ..., s P , our goal is to find implications between the rule sentences and the user provided information. Inspired by Recurrent Entity Network (Henaff et al., 2017) which tracks the world state given a sequence of textual statements, we propose the Explicit Memory Tracker (EMT), a gated recurrent memory-augmented neural network which explicitly tracks the states of rule sentences by sequentially reading the user provided information. As shown in Figure 2, EMT explicitly takes rule sentences k 1 , ..., k M as keys, and assigns a state v i to each key to save the most updated entailment information (whether this rule has been entailed from the user provided information). Each value state v i is initialized with the same value of its corresponding rule sentence: v i,0 = k i . Then EMT sequentially reads user provided information s Q , s S , s 1 , ..., s P . At time step t, the value state v i,t for i-th rule sentence is updated by incorporating the user provided information where W k , W v , W s ∈ R d×d , σ represents a sigmoid function, and is scalar product. As the user background input s t may only be relevant to parts of the rule sentences, the gating function in Equation 2 matches s t to the memory. Then EMT updates state v i,t only in a gated manner. Finally, the normalization allows EMT to forget previous information, if necessary. After EMT sequentially reads all user provided information (the initial question, scenario, and P turns of history dialog) and finishes entailment-oriented reasoning, keys and final states of rule sentences are denoted as (k 1 , v 1 ), ..., (k M , v M ), which will be used in the decision making module (Section 2.3) and question generation module (Section 2.4).
The key difference between our Explicit Memory Tracker and Recurrent Entity Network (REN) (Henaff et al., 2017) is that each key k i in our case has an explicit meaning (the corresponding rule sentence) and thus it changes according to different rule texts while in REN, the underlined meaning of keys are learned through training and they are fixed throughout all textual inputs. Moreover, the number of keys is dynamic in our case (according to the number of sentences parsed from the rule text) while that is predefined in REN.

Decision Making
Based on the most up-to-date key-value states of rule sentences (k 1 , v 1 ), ..., (k M , v M ) from the EMT, the decision making module predicts a decision among Yes, No, Irrelevant, and Inquire. First, we use self-attention to compute a summary vector c for the overall state: where [k i ; v i ] denotes the concatenation of the vectors k i and v i , and α i is the attention weight for the rule sentence k i that determines the likelihood that k i is entailed from the user provided information.
Then the final decision is made through a linear transformation of the summary vector c: where z ∈ R 4 contains the model's score for all four possible classes. Let l indicate the correct decision, the decision making module is trained with the following cross entropy loss: In order to explicitly track whether a condition listed in the rule has already been satisfied or not, we add a subtask to predict the entailment states for each rule sentence. The possible entailment labels are Entailment, Contradiction and Unknown; details of acquiring such labels are described in Section 3.1. With this intermediate supervision, the model can make better decisions based on the correct entailment state of each rule sentence. The entailment prediction is made through a linear transformation of the most up-to-date key-value state [k i ; v i ] from the EMT module: where e i ∈ R 3 contains scores of three entailment states [β entailment,i , β contradiction,i , β unknown,i ] for the i-th rule sentence. Let r indicate the correct entailment state. The entailment prediction subtask is trained with the following cross entropy loss, normalized by the number of rule sentences M :

Follow-up Question Generation
When the decision making module predicts Inquire, a follow-up question is required for further clarification from the user. In the same spirit of previous studies (Zhong and Zettlemoyer, 2019;Sharma et al., 2019), we decompose this problem into two stages. First, we extract a span inside the rule text which contains the underspecified user information (we name it as underspecified span hereafter). Second, we rephrase the extracted underspecified span into a follow-up question. We propose a coarse-to-fine approach to extract the underspecified span for the first stage, and finetune the pretrained language model UniLM (Dong et al., 2019) for the follow-up question rephrasing, as we describe below.
Coarse-to-Fine Reasoning for Underspecified Span Extraction. Zhong and Zettlemoyer (2019) extract the underspecified span by extracting several spans and retrieving the most likely one. The disadvantage of their approach is that extracting multiple rule spans is a challenging task, and it will propagate errors to the retrieval stage. Instead of extracting multiple spans from the rule text, we propose a coarse-to-fine reasoning approach to directly identify the underspecified span. For this, we reuse the Unknown scores β unknown,i from the entailment prediction subtask (Eqn. 9), and normalize it (over the rule sentences) with a softmax to determine how likely that the i-th rule sentence contains the underspecified span: Knowing how likely a rule sentence is underspecified greatly reduces the difficulty to extract the underspecified span within it. We adopt a soft selection approach to modulate span extraction (i.e., predicting the start and end points of a span) score by rule sentence identification score ζ i . We follow the BERTQA approach (Devlin et al., 2019) to learn a start vector w s ∈ R d and an end vector w e ∈ R d to locate the start and end positions from the whole rule text. The probability of j-th word in i-th rule sentence u i,j being the start/end of the span is computed as a dot product between w s and u i,j , modulated by its rule sentence score ζ i : We extract the span with the highest span score γ * δ under the restriction that the start and end positions must belong to the same rule sentence. Let s and e be the ground truth start and end position of the span. The underspecified span extraction loss is computed as the pointing loss L span,s = −1 l=inquire log softmax(γ) s (13) L span,e = −1 l=inquire log softmax(δ) e The overall loss is the sum of the decision loss, entailment prediction loss and span extraction loss where λ 1 and λ 2 are tunable hyperparameters.
Question Rephrasing. The underspecified span extracted in the previous stage is fed into the question rephrasing model to generate a follow-up question. We finetune the UniLM (Dong et al., 2019) to achieve this goal. UniLM is a pretrained language model which demonstrates its effectiveness in both natural language understanding and generation tasks. Specifically, it outperforms previous methods by a large margin on the SQuAD question generation task (Du and Cardie, 2018). As shown in Figure 2, UniLM takes the concatenation of rule text and the extracted rule span as input, separated by the sentinel tokens: [CLS] (Saeidi et al., 2018). It contains 948 dialog trees, which are flattened into 32,436 examples by considering all possible nodes in the trees. Each example is a quintuple of (rule text, initial question, user scenario, dialog history, decision), where decision is either one of {Yes, No, Irrelevant} or a follow-up question. The train, development, and test dataset sizes are 21890, 2270, and 8276, respectively. 1 End-to-End Evaluation. Organizers of the ShARC competition evaluate model performance as an end-to-end task. They first evaluate the microand macro-accuracy for the decision making task. If both the ground truth decision and the predicted decision are Inquire, then they evaluate the generated follow-up question using BLEU score (Papineni et al., 2002). However, this way of evaluating follow-up questions has one issue. If two models have different Inquire predictions, the follow-up questions for evaluation will be different, making the comparison unfair. For example, a model could classify only one example as Inquire in the whole test set and generate the follow-up question correctly, achieving a 100% BLEU score. Therefore, we also propose to evaluate the follow-up question generation performance in an oracle evaluation setup as described below.
Oracle Question Generation Evaluation. In this evaluation, we ask the models to generate follow-up questions whenever the ground truth decision is Inquire, and compute the BLEU score for the generated questions accordingly. Data Augmentation. In the annotation process of the ShARC dataset, the scenario is manually constructed from a part of the dialog history, and that excerpt of the dialog is not shown as input to Labeling Underspecified Spans. To supervise the process of coarse-to-fine reasoning, we follow Zhong and Zettlemoyer (2019) to label the rule spans. We first trim the follow-up questions in the conversation by removing question words "do, does, did, is, was, are, have" and the question mark "?". For each trimmed question, we find the shortest span inside the rule text which has the minimum edit distance from the trimmed question, and treat it as an underspecified span.
Acquiring Labels for Entailment. To supervise the subtask of entailment prediction for each rule sentence, we use a heuristic to automatically label its entailment state. For each rule sentence, we first find if it contains any underspecified span for the questions in the dialog history (and evidence text), and use the corresponding Yes/No answers to label the rule text as Entailment/Contradiction. The rule text without any underspecified span is labeled as Unknown.
Implementation Details. We tokenize all text inputs with spaCy (Honnibal and Montani, 2017   size of 16 and a learning rate of 2e-5, and we use a beam size of 10 for inference.
To reduce the variance of our experimental results, all experiments reported on the development set are repeated 5 times with different random seeds. We report the average results along with their standard deviations.

Results
End-to-End Task. The end-to-end performance on the held-out test set is shown in Table 1. EMT outperforms the existing state-of-the-art model E 3 on decision classification in both micro-and macroaccuracy. Although the BLEU scores are not directly comparable among different models, EMT achieves competitive BLEU1 and BLEU4 scores on the examples it makes an Inquire decision. The results show that EMT has strong capability in both decision making and follow-up question generation tasks.  Oracle Question Generation Task. To establish a concrete question generation evaluation, we conduct experiments on our proposed oracle question generation task. We compare our model EMT with E 3 and an extension E 3 +UniLM; implementations for other methods are not publicly available. E 3 +UniLM replaces the editor of E 3 with our finetuned UniLM. The results on the development set and 10-fold cross validation are shown in Table 3. Firstly, E 3 +UniLM performs better than E 3 , validating the effectiveness of our follow-up question rephrasing module: finetuned UniLM. More importantly, EMT consistently outperforms E 3 and E 3 +UniLM on both the development set and the cross validation by a large margin. Although there is no ground truth label for span extraction, we can infer from the question generation results that our coarse-to-fine reasoning approach extracts better spans than the extraction and retrieval modules of E 3 . This is because E 3 propagates error from the span extraction module to the span retrieval module while our coarse-to-fine approach avoids this problem through weighting token-level span distributions with its sentence-level entailment scores.

Ablation Study
We conduct an ablation study on the development set for both the end-to-end evaluation task and oracle question generation evaluation task. We consider four ablations of our EMT model: (1) EMT (w/o data aug.) trains the model on the original ShARC training set and do not use any augmented data using the evidence. Results of the ablations are shown in Table 4, and we have the following observations: • With the help of data augmentation, EMT boosts the performance slightly on the end-to-end task, especially for the question generation task which originally has only 6804 training examples. The augmented training instances boosts the performance even though the augmentation method does not produce any new question. This implies that the size of the ShARC dataset is a bottleneck for an effective end-to-end neural models.
• Without the coarse-to-fine reasoning for span extraction, EMT (w/o c2f) drops by 1.53 on BLEU4, which implies that it is necessary for the question generation task. The reason is that, as a classification task, entailment state prediction can be trained reasonably well (80% macro accuracy) with a limited amount of data (6804 training examples). Therefore, the Unknown scores in the entailment state prediction can guide the span extraction via a soft modulation (Equation 12). On the other hand, one-step span extraction method does not utilize the entailment states of the rule sentences from EMT, meaning it does not learn to extract the underspecified part of the rule text.
• With the guidance of explicit entailment supervision, EMT outperforms EMT (w/o L entail ) by a large margin. Intuitively, knowing the entailment states of the rule sentences makes the decision making process easier for complex tasks that require logical reasoning on conjunctions of conditions or disjunctions of conditions. It also helps span extraction through the coarse-to-fine approach.
• Without the explicit memory tracker described in Section 2.2, EMT (w/o tracker) performs poorly on the decision making task. Although there exist interactions between rule sentences and user information in BERT-encoded representations through  multi-head self-attentions, it is not adequate to learn whether conditions listed in the rule text have already been satisfied or not.

Interpretability
To get better insights into the underlying entailment-oriented reasoning process of EMT, we examine the entailment states of the rule sentences as the conversation flows. Two example cases are provided in Figure 3. Given a rule text containing several rule sentences (S1, S2, S3, ...), we show the transition of predicted entailment states [β entailment , β contradiction , β unknown ] over multiple turns in the dialogue.
Rules in Bullet Points. Figure 3 (a) shows an example in which the rule text is expressed in the conjunction of four bullet-point conditions. On the first turn, EMT reads "Scenario" and "Initial Question" and they only imply that the question from the user is relevant to the rule text. Thus the entailment states for all the rule sentences are Unknown, and EMT makes an Inquire decision, and asks a question. Once a positive answer is received from the user part for the first turn, EMT transits the entailment state for rule sentence S3 from Unknown to Entailment, but it still cannot conclude the dialogue, so it asks a second follow-up question.
Then we see that the user response for the second question is negative, which makes EMT conclude a final decision No in the third turn.
Rules in Plain Text. Figure 3 (b) presents a more challenging case where the rules are in plain text. Therefore, it is not possible to put the whole sentence into a clarification question as EMT in Figure  3(a) does. In this case, both the decision making module and span extraction module contribute to helping the user. The span extraction module locates the correct spans inside S2, and EMT concludes a correct answer "No" after knowing the user does not fulfill the condition listed in S2.

Error Analysis
We analyze some errors of EMT predictions on the ShARC development set, as described below.
Decision Making Error. Out of 2270 examples in the development set, our EMT produces incorrect decisions on 608 cases. We manually analyze 104 error cases. In 40 of these cases, EMT fails to derive the correct entailment states for each rule sentence, while in 23 cases, the model predicts the correct entailment states but cannot predict correct decisions based on that. These errors suggest that explicitly modeling the logic reasoning process is a promising direction. Another challenge comes from extracting useful information from the user scenarios. In 24 cases, the model fails to make the correct decision because it could not infer necessary user information from the scenarios. Last but not least, parsing the rule text into rule sentences is also a challenge. As shown in Figure 3(b), the plain text usually contains complicated clauses for rule conditions, which is difficult to disentangle them into separate conditions. In 17 cases, one single rule sentence contains multiple conditions, which makes the model fail to conduct the entailment reasoning correctly.
Question Generation Error. Out of 562 question generation examples in the development set, our EMT locates the underspecified span poorly in 115 cases (span extraction F1 score ≤ 0.5). We manually analyze 52 wrong question generation cases. Out of 29 cases of them, EMT fails to predict correct entailment states for rule sentences, and thus does not locate the span within the ground truth rule sentence, while in 9 cases, it finds the correct rule sentence but extracts a different span. Another challenge comes from the one-to-many problem in sequence generation. When there are multiple underspecified rule sentences, the model asks about one of these underspecified rule sentences which is different from the ground truth one. This suggests that new evaluation metrics could be proposed by taking this into consideration.

Related Work
ShARC Conversational Machine Reading (Saeidi et al., 2018) differs from conversational question answering (Choi et al., 2018;Reddy et al., 2019) and conversational question generation  in that 1) machines are required to formulate follow-up questions to fill the information gap, and 2) machines have to interpret a set of complex decision rules and make a question-related conclusion, instead of extracting the answer from the text. CMR can be viewed as a special type of task-oriented dialog systems (Wen et al., 2017;Zhong et al., 2018;Wu et al., 2019) to help users achieve their goals. However, it does not rely on predefined slot and ontology information but natural language rules. On the ShARC CMR challenge (Saeidi et al., 2018), Lawrence et al. (2019) propose an end-toend bidirectional sequence generation approach with mixed decision making and question generation stages. Saeidi et al. (2018) split it into sub-tasks and combines hand-designed sub-models for decision classification, entailment and question generation. Zhong and Zettlemoyer (2019) propose to extract all possible rule text spans, assign each of them an entailment score, and edit the span with the highest score into a follow-up question. However, they do not use these entailment scores for decision making. Sharma et al. (2019) study patterns of the dataset and include additional embeddings from dialog history and user scenario as rule markers to help decision making. Compared to these methods, our EMT has two key differences: (1) EMT makes decision via explicitly entailmentoriented reasoning, which, to our knowledge, is the first such approach; (2) Instead of treating decision making and follow-up question generation (or span extraction) separately, EMT is a unified approach that exploits its memory states for both decision making and question generation.
Memory-Augmented Neural Networks. Our work is also related to memory-augmented neural networks (Graves et al., 2014(Graves et al., , 2016, which have been applied in some NLP tasks such as question answering (Henaff et al., 2017) and machine translation (Wang et al., 2016). For dialog applications, Zhang et al. (2019) propose a dialogue management model that employs a memory controller and a slot-value memory, Bordes et al. (2016) learn a restaurant bot by end-to-end memory networks, Madotto et al. (2018) incorporate external memory modules into dialog generation.

Conclusions
In this paper, we have proposed a new framework for conversational machine reading (CMR) that comprises a novel explicit memory tracker (EMT) to track entailment states of the rule sentences explicitly within its memory module. The updated states are utilized for decision making and coarseto-fine follow-up question generation in a unified manner. EMT achieved a new state-of-the-art result on the ShARC CMR challenge. EMT also gives interpretability by showing the entailment-oriented reasoning process as the conversation flows. While we conducted experiments on the ShARC dataset, we believe the proposed methodology could be extended to other kinds of CMR tasks.