Conversational Machine Comprehension: a Literature Review

Conversational Machine Comprehension (CMC), a research track in conversational AI, expects the machine to understand an open-domain natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. While most of the research in Machine Reading Comprehension (MRC) revolves around single-turn question answering (QA), multi-turn CMC has recently gained prominence, thanks to the advancement in natural language understanding via neural language models such as BERT and the introduction of large-scale conversational datasets such as CoQA and QuAC. The rise in interest has, however, led to a flurry of concurrent publications, each with a different yet structurally similar modeling approach and an inconsistent view of the surrounding literature. With the volume of model submissions to conversational datasets increasing every year, there exists a need to consolidate the scattered knowledge in this domain to streamline future research. This literature review attempts at providing a holistic overview of CMC with an emphasis on the common trends across recently published models, specifically in their approach to tackling conversational history. The review synthesizes a generic framework for CMC models while highlighting the differences in recent approaches and intends to serve as a compendium of CMC for future researchers.


Introduction
Developing open-domain, intelligent dialog systems that can satisfactorily interact like humans, perform complex tasks and/or answer on a range of topics has been one of the most ambitious and difficult goals in Artificial Intelligence (AI). The study of such systems, called Conversational AI (ConvAI), is at the confluence of Natural language Processing (NLP), Information Retrieval (IR), and Machine Learning (ML), attracting significant research from both academia and industry. The recent developments in Deep Learning (DL) (Du and Black, 2019;Hatua et al., 2019) and Reinforcement Learning (RL) (Lipton et al., 2016;Peng et al., 2018) have further boosted research in the domain, making it one of the most sought after research topics in AI.
Based on the nature of problems, a ConvAI system is expected to solve three major research problems . Question Answering (QA) involves providing answers to user queries through conversation, using the knowledge drawn from various data sources like a snippet from a text, a collection of web documents, or an entire knowledge base. Task completion expects the conversational agent to accomplish task/s for the user, using the information acquired through conversation. Finally, Social Chat makes the agent emulate humans and converse seamlessly and appropriately with users, as in the Turing test (Saygin et al., 2000). Each of these fields has its own set of challenges to tackle.
Challenges in QA can vary depending on the source of knowledge, the answer extraction strategy employed, and the domain of the question. Machine Reading Comprehension (MRC) is one such challenge in QA, that requires the conversational QA (ConvQA) agent to understand a given open-domain text and thereafter answer question/s in conversation about it. These questions are often not paraphrased and may co-reference previous queries. The required answer may be a span of the given text or freeform. When the machine comprehension dialog involves multiple co-referenced questions such that a latter question may be a logical successor of the former, the challenge is termed as Conversational Machine Comprehension (CMC). A lot of research in MRC revolves around single-turn QA, but multi-turn CMC also holds major relevance because humans seek information conversationally by asking follow-up questions for additional information based on what they have already learned. Still, the inherent complexity involved in dealing with text comprehension and reasoning over dialogs and context had kept CMC as a far-fetched goal. However, the recent success in achieving at-par-with-human performance on single-turn MRC models (Rajpurkar et al., 2018) due to the advancement in natural language understanding and modeling (Devlin et al., 2019;Lan et al., 2019;, and the introduction of large-scale conversational datasets CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018) have made information-seeking dialogs possible.
As a consequence, CMC has seen a significant surge in research in recent years. In less than 2 years since the introduction of these datasets, there have been 40 submissions 1 to CoQA leaderboard 2 and 22 submissions 1 to QuAC leaderboard 3 . Many of these models are unpublished, indicating active ongoing research on these datasets. Besides, the current state-of-the-art in QuAC lags behind human performance F1 benchmark by a margin of 6.7 1 , suggesting significant scope for improvement. Almost simultaneously, there have been breakthroughs in NLP (Devlin et al., 2019;Radford et al., 2018; which the researchers have tried to leverage in their upcoming models (Qu et al., 2019b;Yeh and Chen, 2019;. Since many models are being published concurrently, there have been inconsistencies and/or overlap in their methodologies and research directions. This makes it difficult to compare different approaches against each other and weigh their pros and cons. This prevailing scenario has blurred the bigger picture and made it difficult for researchers to attend to novel research in this field. Moreover, there is no singular summarized view on CMC models, except the individual literature studies of these publications which can be highly localized and inconsistent with the global view. Thus, the current scenario motivates the need for organizing the scattered knowledge across these publications into a consolidated overview, so that future research in this field can be streamlined.
This literature review, therefore, provides a bird-eye overview of Conversational Machine Comprehension. We commence with an introduction to CMC, acquainting the reader with the challenges that make CMC unique, and the large-scale conversational datasets that spurred research in this field. To develop a general understanding of the CMC approaches, we shift the focus from comprehending individual models to observing the common trends that mark these models, synthesizing a generic framework for a CMC model in the process. We finally end our review with a discussion on the current trends and suggest advancements in the future.

Related Work
There have been several published literature reviews on MRC in recent years.  provides an extensive review of Conversational AI with a detailed account of the neural approaches being employed in each of its dialog systems (QA, Task completion, and social chat). It briefly discusses the problem of CMC and its datasets but does not comment upon the recent advancements and prevalent approaches in this domain.  provides a summary of all the recent single-turn MRC datasets and approaches, however, it briefly discusses CoQA but does not touch upon any approaches for CMC.  summarizes the classic models of single-turn MRC with a focus on deriving a common architecture and suggesting improvements based on the analysis. CMC is mentioned as an emerging research direction in this survey. The latest review by Baradaran et al. (2020) provides an overview of MRC along with the statistical analysis of datasets and the various problems in this domain. It mentions CMC as an MRC challenge but does not provide any further details. This review differs from its predecessors as it focuses primarily on Conversational (multi-turn) Machine Comprehension which has not been detailed in the previous literature. CMC has its own set of challenges and an active research community around it. This calls for considering CMC as a separate research direction from single-turn MRC and review its rapid developments in terms of its general trends.

What is Conversational Machine Comprehension?
The task of CMC is defined as: Given a passage P , the conversation history in the form of questionanswer pairs {Q 1 , A 1 , Q 2 , A 2 , ..., Q i−1 , A i−1 } and a question Q i , the model needs to predict the answer A i . The answer A i can either be a text span (s i , e i ) (Choi et al., 2018) or a free-form text {a i,1 , a i,2 , ..., a i,j } with evidence R i (Reddy et al., 2019). Single-turn MRC models cannot directly cater to CMC, as the latter is much more challenging to address. The major challenges being: • The encoding module needs to encode not only P and A i but also the conversational history.
• General observation about information-seeking dialog in humans suggests that the starting dialogturns tend to focus on the beginning chunks of the passage and shift focus to the later chunks as the conversation progresses (Choi et al., 2018). The model is thus expected to capture these focal shifts during a conversation and reason pragmatically, instead of only matching lexically or via paraphrasing.
• Multi-turn conversations are generally incremental and co-referential. These conversational dialogs are either drilling down (the current question is a request for more information about the topic), shifting topic (the current question is not immediately relevant to something previously discussed), returning topic (the current question is asking about a topic again after it had previously been shifted away from), clarification of topic, or definition of an entity (Yatskar, 2019). The model should, therefore, be able to take context from history which may or may not be immediate.

Multi-Turn Conversational Datasets
The surge in CMC research is credited to the emergence of large-scale multi-turn conversational datasets: CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018).
• Dataset preparation: Conversations are prepared over passages collected across 7 different domains, each with its source dataset, such as news articles derived from CNN (Hermann et al., 2015). Amongst the 7 domains, two are used for out-of-domain evaluation (only for evaluation, not training), while the other five aid in-domain evaluation (both training and evaluation). The dialog is prepared in a two annotator setting with one questioning and another answering, both referring to the entire context.
• Questions: Questions are factoid but require sufficient co-referencing and pragmatic reasoning (Bell, 1999).
• Answers: Answers are free-form, with their corresponding rationale highlighted in the passage. However, Yatskar (2019) identified that the answers are slightly modified versions of the rationale, and therefore optimizing an extractive model to predict the answer span with maximum F1 overlap to the gold answer can achieve up to 97.8 F1.
• Dialog features: The dialogs mostly involve drilling-down for details (about 60% of all questions) but lack other dialog features like topic-shift, clarification, or definition.
• Evaluation: Macro-average F1 score of word overlap is used as an evaluation metric and is computed separately for in-domain and out-of-domain.
• Dataset preparation: Dialogs are prepared over sections from Wikipedia articles about people from different genres such as culture and wildlife. The dataset is prepared using an asymmetric setting, with a student exposed only to the title of the article and a summary while the teacher is exposed to the entire section of the article on which the dialog is to be based. The student, therefore, tries to seek information about the hidden questions based on the limited information it gets from the dialog, and the teacher answers by providing short excerpts from the section (or 'No Answer' if not possible).
• Questions: Questions are descriptive, highly-contextual, and open-ended due to the asymmetric nature of the dataset that prevents paraphrasing. They require sufficient co-referencing and pragmatic reasoning.
• Dialog features: Besides drilling down, dialogs switch to new topics more frequently than CoQA. The dataset though lacks definition or clarification dialogs.
• Answers: Answers are extractive and can be either Yes/No or 'No Answer'. Besides extractive span, the response also includes additional signals called dialog acts like continuation (follow up, maybe follow up, or don't follow up) and affirmation (yes, no, or neither), which provides additional useful dialog flow information to train on, as used by Qu et al. (2019b) and Ju et al. (2019). Further, an analysis of the answer token lengths in Table 1 shows that QuAC answers are longer, which can be attributed to its asymmetric nature thereby motivating the seeker to ask open-ended questions to gauge hidden text.
• Evaluation: Besides the macro-averaged F1 score on the entire set, QuAC also evaluates Human Equivalence Score (HEQ) to judge system performance relative to an average human, by finding the percentage of instances for which the system's F1 matches or exceeds human F1. HEQ-Q and HEQ-D are thus HEQ scores with the instances as questions and dialogs respectively.
General dataset characteristics and an example from each of the datasets are provided in Appendix A.  defined the steps for performing reading comprehension in a typical neural MRC model as (1) encoding the questions and context into a set of embeddings in a neural space;

Generic Framework of a CMC Model
(2) reasoning in the neural space to identify the answer vector and (3) decoding the answer vector into a natural language output. Huang et al. (2018a) adapted these steps in CMC by adding conversational history modeling. Qu et al. (2019c) proposed a ConvQA model with separate modules for history selection and modeling. Based on these prior works, we synthesize a generic framework for a CMC model. A typical CMC model is provided with context C, current question Q i and the conversation history , and needs to generate an output set O i . The CMC framework is provided in Fig. 1. There are four major components of the framework, based on their contribution to the overall CMC flow.
1. History Selection module: With complicated dialog behaviors like topic shift or topic return (Yatskar, 2019), simply selecting immediate turns may not work well. A history selection module, therefore, chooses a subset H i of the history turns H i based on a policy (dynamic or static) that is expected to be more helpful than the others. If the history selection module is based on a dynamic learned policy (e.g. Qu et al. (2019b)), then feedback from the other modules can guide its update.

Encoder:
The lexical tokens of the context passage C, selected conversational turns H i , and the current question Q i need to be transformed into input embeddings for the reasoning module. Encoder facilitates this transition. The encoder steps may vary with every approach and reasoning inputs, at a high level, encoding involves transformation and combination of context-independent word embeddings called lexical embeddings such as GloVE (Pennington et al., 2014), intra-sequence contextual  Figure 1: Generic framework of a CMC model. A typical CMC model would consist of (1) History selection module, that selects a subset H i of conversational history H i relevant to the current question Q i ; (2) Encoder, that encodes the lexical tokens of context C, Q i and H i into input embeddings for contextual integration layer; (3) Reasoning module, that performs contextual integration of input embeddings into contextualized embeddings; and finally, (4) Output predictor, that predicts the output set O i based on contextualized embeddings. embeddings e.g. ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) or RNN, question-aware embeddings, and additional feature embeddings like POS tags , history embedding (Qu et al., 2019c) or conversation count. Conversational history H i is generally integrated with this module into any or all of the contextual input embeddings. This process is called History modeling and is the most significant aspect of a CMC encoder.
3. Contextual Integration layer: Contextual information accumulated in the passage, query, and/or history embeddings individually must be fused to generate query-aware and/or history-aware contextualized output embeddings. This process may involve a single layer (single-step reasoning) or repetition across multiple layers (multi-step reasoning). Input for this module generally consists of two (or more) sequence sets for every history turn, or aggregated across all turns, which are then fused in each layer and often inter-weaved (Huang et al., 2018b) with attention.

Output Predictor:
The model output may be in the form of a text span, signals like dialog acts (Choi et al., 2018) or a free-form (abstractive) answer (Reddy et al., 2019). Contextual embeddings generated by the reasoning module have all the latent information about the question, context passage, and conversational history. To get the token-level output, a fully-connected network followed by a softmax layer is generally used for per-token probability (abstractive) or start/end probability (extractive). Besides, a linear neural network may be used to find the aggregated result of the sequence.

Common Trends across CMC models
Instead of describing each CMC model separately, we categorized them under the approaches they employ in their components (section 5) or other model characteristics. This will help in developing a high-level understanding of the CMC models. A model-wise summary of the CMC models is provided in Appendix C.

Trends in History Selection
Almost all of the current CMC models select conversational history based on a heuristic of considering k immediate turns, often decided by performance such as BiDAF++ (Choi et al., 2018;Yatskar, 2019), SDNet , BiDAF++ w/ 2-ctx (Ohsugi et al., 2019) use last two turns as including the third turn degrades performance. History Attention Mechanism (HAM) based model Qu et al. (2019b) uses a dynamic history selection policy by attending over contextualized representations of all the previous history turns at word-level or sequence-level and combining with current turn's representation as shown in Fig. 3a.

Trends in History Modeling
How conversational history is integrated or used in the encoding process of contextual input embeddings can be used to classify CMC models. Different trends observed in this respect are described below. Some models may use a combination of these approaches.
1. Appending selected history questions and/or answers (in raw form or text span indices) to the current question before encoding. QA tokens across turns should be distinguishable or separated when appending. Models DrQA+PGNet (Reddy et al., 2019), SDNet  and RoBERTa + AT + KD (Ju et al., 2019) append all history QA pairs separated by tokens like symbols On the other hand, QuAC baseline model BiDAF++ w/ 2-ctx (Ohsugi et al., 2019) and GraphFlow  append only the history questions to the current question and encode relative dialogturn number within each question embedding to differentiate. Choi et al. (2018) validate that this dialog-turn encoding strategy performs better in practice.
2. Encoding context tokens with history answer marker embeddings (HAE) before passing on for reasoning. These embeddings indicate if the context token is present in any conversational history answer or not, such as in BiDAF++ w/ 2-ctx (Choi et al., 2018), GraphFlow , BERT+HAE (Qu et al., 2019a) and HAM (Qu et al., 2019b). HAM encodes a dialog-turn encoded variant of HAE called Positional HAE. It maintains a lookup table of history embeddings for every relative position from the current conversation and embeds the corresponding embedding if the token is found in that history answer, e.g. for the current question q k if a token is found in history answer a k−2 then Positional HAE embedding at index 2 is encoded, otherwise embedding at index 0 is encoded. This setting is illustrated in Fig. 3b.
3. Integrating intermediate representations generated in the reasoning modules of selected history conversation turns to grasp the deep latent semantics of the history, rather than acting on raw inputs. This approach is also called the FLOW based approach. The models that follow this approach are FlowQA (Huang et al., 2018a), FlowDelta (Yeh and Chen, 2019), and GraphFlow . GraphFlow encodes conversational histories into context graphs which are used by the reasoning module for contextual analysis.
For contextual encoding, most of the models utilize one of the two types of encoders: (a.) Bidirectional sequential language models such as BiDAF (Seo et al., 2017) or ELMo (Peters et al., 2018) (b.) Deep bidirectional transformer-based models such as BERT (Devlin et al., 2019) or RoBERTa .

Trends in Contextual Reasoning
While every CMC model has its unique flavor in integrating encoded representations of the query, history, and text contextually, some recurrent themes in reasoning can still be drawn. It is important to note that some of these themes will reflect state-of-the-art techniques around their release, which may now be obsolete. However, having their knowledge would prevent the re-exploration of those ideas. Following are the commonly observed themes: A. Attention-based Reasoning with Sequence Models This was a common theme across MRC models until transformers (Vaswani et al., 2017) were introduced and got rid of sequence modeling. Consequently, initial baseline models were based on this approach. CoQA baseline (Reddy et al., 2019) first involves DrQA (Chen et al., 2017), which performs BiLSTM based contextual integration over encoded tokens for extractive span, and later PGNet, that uses attentionbased neural machine translation (Bahdanau et al., 2015) for abstractive answer reasoning. QuAC baseline (Choi et al., 2018) combines self-attention with BiDAF (Seo et al., 2017) that performs reasoning via multi-layered bidirectional attention followed by multi-layered BiLSTM (BiDAF++). SDNet  applies both inter-attention and self-attention in multiple layers, interleaved with BiLSTM, to comprehend conversation context.

B. FLOW based approaches
Analogous to recurrent models which propagate contextual information through the sequence, FLOW is a sequence of latent representations that propagates reasoning in direction of the dialog progression by feeding intermediate latent representations, generated during reasoning in previous conversations, into contextual reasoning for the current question. This helps to leverage the reasoning effort of previous conversations as compared to using shallow history, such as directly appending history question-answers, where important contextual information in conversations may be lost due to the overwhelming input. There are two major flow-based approaches based on the propagated latent representation.
(a) Integration-Flow reasoning involves alternating computation between context integration (RNN over context) and FLOW (RNN over question turns).
(b) FlowQA Architecture: Integration-Flow layers are alternated using cross-attention between the context and the question. Answer is predicted on the final concatenated output. representation. FlowQA (Huang et al., 2018a) which also introduced the idea of FLOW, involves sequential processing along context tokens in parallel to the question turns followed by sequential processing in direction of the question turns (Flow), in parallel to context tokens as illustrated in Fig. 2a. FlowQA employs multiple IF layers interleaved with self and cross attentions to reason over encoded embeddings (Fig. 2b). Recently released FlowDelta (Yeh and Chen, 2019) is an improvement on the IF approach that uses the similar FlowQA architecture and achieves better results. Instead of passing the latent representation directly, as in FlowQA, FlowDelta passes the information gain (the difference between the latent representation of previous 2 layers) with the intuition that information gain would allow the model to focus on more informative cues in context.

Integration-GraphFlow (IG):
GraphFlow  claims that the IF mechanism does not mimic human reasoning, as it first performs reasoning in parallel for each question, and then refines the reasoning results across different turns. Therefore, they use dynamically constructed, question-aware context graphs for each turn as the propagated latent representation. Processing through this flow (called GraphFlow) is facilitated by applying GNNs  on the current context graph and previous context. To capture local interactions among consecutive words in context before feeding to a GNN, a BiLSTM is applied for contextual Integration. GraphFlow architecture alternates this mechanism with co-attention over the question and GNN output. This is illustrated in the figure provided in Appendix B.

C. Contextual Integration using Pre-trained Language Models
Large-scale pre-trained LMs such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018) and RoBERTa , have become the current state-of-the-art approaches for contextual reasoning in CMC models, with leaderboards of both datasets stacked with these models or their variants. The approach is based on the fine-tune BERT-based MRC modeling outlined by Devlin et al. (2019), in which question and context are packed together (with marker embeddings to distinguish) in an input sequence to BERT that outputs contextualized question-aware embeddings for each input token.
Using pre-trained models for reasoning is advantageous in two aspects: Firstly, it simplifies the architecture by fusing encoding and reasoning modules into a single module. Secondly, it provides a ready-to-tune architecture that abstracts out complex contextual interactions between query and context while providing sufficient flexibility to control interactivity via augmentation of input embeddings i.e. concatenation of special embeddings to input tokens that signal the model to incorporate a desirable characteristic in contextualization.
(a) HAM uses a dynamic attention-based history selection policy. Contextualized representations are generated by the model's encoder (BERT+PosHAE) for every history turn at word and sequence levels. Sequence-level embeddings are used to compute attention weights via scaled-dot product, and aggregate representations are generated by a weighted combination of embeddings of each turn in the proportion of their attention weights. Thus, attention weights help in determining the degree of selection (relevance) of each history turn.
(b) HAM's BERT based Encoder (Reasoning Architecture) for every conversation turn. The encoder is provided with input sequence consisting of query tokens (yellow) and context tokens (green) separated by [SEP]. It outputs contextualized representations Ti corresponding to aligned question/passage tokens. The Token embeddings are augmented with segment embeddings(to differentiate query and context), positional embeddings (for distinct position in the sequence), and Positional HAE embeddings (for encoding history answer and relative conversational turn). Figure 3: Illustration of (a) history selection module and (b) encoder/reasoning module of History Attention Mechanism (HAM) model (Qu et al., 2019b).
However, incorporating history into these models is a key challenge in this approach as most of the transformer models such as BERT only accept 2 segments ids in the input sequence. Based on recent research in CMC, two main trends in solving the history integration issue are discussed below: 1. Modify the input embeddings for a single-turn MRC model to incorporate history. This is done by either appending the entire conversation to the question, such as Ju et al. (2019) which uses RoBERTa ) as the base model and truncates query if it exceeds the limit, or add special embeddings to highlight conversational history for the model, such as HAE (Qu et al., 2019a) embeds history answer embeddings with each context token if it is present in any of the history turns (detailed in section 6.2). This approach does not effectively use the model to capture interactions between every dialog-turn and context.
2. Use separate model for each conversational turn to capture one-to-one interaction between history and context, and merge the per-turn contextualized embeddings into aggregated history-aware embeddings. Two models follow this trend. Ohsugi et al. (2019) uses BERT models to capture contextual interaction for every question (history and current) and answer (2N+1 sequences for N turns) and concatenates all sequences together. Finally, it runs Bi-GRU (Cho et al., 2014) over the aggregated sequence to capture inter-turn interactions before sending for prediction. On the other hand, HAM (Qu et al., 2019b) ignores the history questions and uses the current question as a query with positional History Answer Embeddings (section 6.2), thus generating one output sequence per conversation turn. Fig. 3b illustrates HAM encoder. The final sequence is generated using token-level soft-attention based aggregation across all per-turn contextualized sequences.

Trends in Training Methodology
Due to the multi-output nature of both CoQA and QuAC, multi-task training is quite common amongst CMC models, e.g. HAM (Qu et al., 2019b) uses multi-task learning over QuAC to also predict dialog prediction and continuation acts, while GraphFlow  uses multi-task learning over CoQA to also predict question type. Besides, recently published (Ju et al., 2019) achieved state-ofthe-art results using RoBERTa, by applying multiple training techniques over CoQA. These consist of rationale tagging multi-task learning (predict if the token exists in CoQA evidence), Adversarial Training (Goodfellow et al., 2015), and Knowledge Distillation (Furlanello et al., 2018).

Discussion
How does the research progress in CMC, a constrained setup, benefit the more into-the-wild domain of Conversational Search? As stated by Qu et al. (2019a), Conversational QA (and CMC) is a simplified setting of Conversational Search (ConvSearch), an information-seeking, "System Ask, User Respond" paradigm (Zhang et al., 2018b), that does not focus on asking proactively. CMC, specifically, tries to address the challenges of NLU, via contextual encoding, reasoning, and handling conversational history, via history selection and modeling. In that aspect, CMC is a concrete enough setting for IR researchers to understand the change of information needs and interactivity between conversational cycles.
Could Commonsense Reasoning improve CMC? Commonsense Reasoning (CR) is based on the set of background information or world knowledge that an individual is intended to know or assume, and may be missing from context. On the other hand, Pragmatic reasoning, which the current CMC models cater to, is based on the derivation of explicit and implicit meanings within the context. The current MRC systems are nearing human performance on most datasets, however, they still perform poorly on single-turn CR based questions (Zhang et al., 2018a). While there is recently increasing interest in CR in the single-turn MRC setting Ostermann et al., 2018;Lin et al., 2017), CMC remains relatively untouched. This may probably be due to the lack of foreknowledge requiring unanswerable questions (e.g. in SQuAD 2.0 (Rajpurkar et al., 2018)) in current CMC datasets (Yatskar, 2019), suggesting a need for more complex CMC datasets that incorporate CR. However, humans annotators may often apply common-sense reasoning involuntarily while answering questions or comprehending, thus leaving room for incorporating CR in models. There seems to be no recent work that invalidates, experimentally, the role of CR in CMC. QuAC, for example, is drawn from articles on personalities, and current models still lag behind the human benchmark. It may be worth experimenting if adding domain knowledge or attributes about the context, like location and gender, help improve answering these questions.
Why did the paper focus on common trends across each component rather than a single overarching classification of CMC models? The study of common trends in modeling, rather than a single overarching classification, helped in providing a multi-faceted view of CMC that can generalize on future models, and identify possible open-ended research questions, such as (a) For history selection, HAM (Qu et al., 2019b) has proved to be both effective and intuitive in selecting relevant history turns. The application of this history selection approach on previous techniques that considered immediate K turns could be experimented with. (b) As mentioned in training methodology (section 6.4), RoBERTa-based CMC model (Ju et al., 2019) that used knowledge distillation and adversarial training achieved state-of-the-art CoQA results (Reddy et al., 2019). This suggests that different training approach along with multi-task learning improves the performance of base models. These procedures could be experimented with more advanced models such as HAM (Qu et al., 2019b) and FlowDelta (Yeh and Chen, 2019).

Conclusion
In this paper, we provide a holistic overview of Conversational Machine Comprehension (CMC), which has seen a surge of research in recent years, owing to advancements in neural language modeling and the introduction of large-scale conversational datasets such as CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018). We discuss the challenges that make CMC different from machine reading comprehension (MRC) and compare the multi-turn conversational datasets: CoQA and QuAC, based on different CMC characteristics. To develop a high-level understanding of all the existing approaches to tackle CMC, we synthesize a general model framework and analyze the common trends across all the published models, loosely based on the components outlined in the framework. Finally, we discuss some open questions that emerged during our research and which, in our view, can be explored further. This review could serve as a compendium for researchers in this domain and help streamline research in CMC.

A General statistics for CoQA and QuAC
The dataset statistics are provided in Table 1. An example from both datasets is provided in Fig. 5 Characteristic (average) CoQA QuAC Dataset source Passages collected from 7 diverse domains e.g. children stories from MCTest, news articles from CNN, Wikipedia articles, etc.
Sections from Wikipedia articles filtered in the "people" category associated with subcategories like culture, animal, geography, etc.

Conversation setting
Questioner-Answerer setting where both have access to the entire context.
Teacher-Student setting where the teacher has access to the full context for answering, while the student has only the title and summary of the article.   (Reddy et al., 2019) and QuAC (Choi et al., 2018) based on different characteristics defined in their papers and the analysis paper by Yatskar (2019). Figure 4: Architecture of the Reasoning Layer of GraphFlow. Context graph-based flow sequence is processed using GNNs and alternated with bi-LSTM and co-attention mechanisms. Source :  (a) A QA dialog example in the CoQA dataset. Every dialog is based on a context and each turn of the dialog contains a question (Qi), an answer (Ai) and a rationale (Ri) that supports the answer. There is sufficient co-referencing between dialog turns as seen in this example -'Where' in Q2 follows on the candidature mentioned in Q1, 'his' in Q4 points to A3, 'he' in Q5 references A4, and 'them' in Q6 refers to people mentioned in both A3 and A4. Source: (Reddy et  The questions are open-ended due to the asymmetric nature of dataset. There is also sufficient co-referencing -'she' in Q3 refers to the protagonist and is a succession of Q2, similarly Q7 is a follow-up on Q5, 'it' in Q6 refers to song mentioned in A5. C A summary of the common CMC models

B GraphFlow
The following tables provide a summary of the CMC models published on the CoQA 2 and QuAC 3 leaderboards 4 . The table also provides a link to the official code repositories for the models. Although most of the models are published on both leaderboards, some models are very specific to one of the datasets.  (Goodfellow et al., 2015) and Knowledge Distillation (Furlanello et al., 2018).

Model
No official code repo is given.
No official code repo is given.
HAM (Qu et al., 2019b) N/A 65.4 Dynamic selection policy by attending over all the previous history turns and deriving weights based on their contextual correlation with the current turn. Concatenate context tokens with positional history answer marker embeddings corresponding to each dialog turn at a relative position from the current turn e.g. marker embedding 2 for turn 5 when the current turn is 7. Ignores history questions.
Aggregates contextual outputs from BERT-Base models for every answer (N sequences for N turns) using attention-based selection weights to obtain word-level and seq-level representations.
Dialog act (continuation and affirmation) prediction multi-task https://github.com/prdwb/ attentive_history_selection Bert-FlowDelta (Yeh and Chen, 2019) 77.7 65.5 All history questions only Integrate latent representations generated via contextual reasoning on the history turns (FLOW) Each history turn is first passed through BERT whose last and second last layer outputs are input to separate FlowQA models. Reasoning is done using the Integration-Flow approach with information gain as the propagated FLOW representation.

64.9
All the history Q&A Uses a mix of all modeling techniques-encodes relative dialog turn numbers within each question embedding and appends all to the current question, then concatenates answer marker embeddings to context tokens and finally encodes the turn into N context graphs which are used for reasoning. The encoded context graphs per turn are used as the propagated FLOW representation in the Integration-GraphFlow architecture.

GraphFlow
FlowQA (Huang et al., 2018a) 75.0 64.1 All history questions only Integrate latent representations generated via contextual reasoning on the history turns (FLOW) Reasoning is done using the Integration-Flow approach with latent history turn representation directly as the propagated FLOW representation (instead of information gain).