MCˆ2: Multi-perspective Convolutional Cube for Conversational Machine Reading Comprehension

Conversational machine reading comprehension (CMRC) extends traditional single-turn machine reading comprehension (MRC) by multi-turn interactions, which requires machines to consider the history of conversation. Most of models simply combine previous questions for conversation understanding and only employ recurrent neural networks (RNN) for reasoning. To comprehend context profoundly and efficiently from different perspectives, we propose a novel neural network model, Multi-perspective Convolutional Cube (MCˆ2). We regard each conversation as a cube. 1D and 2D convolutions are integrated with RNN in our model. To avoid models previewing the next turn of conversation, we also extend causal convolution partially to 2D. Experiments on the Conversational Question Answering (CoQA) dataset show that our model achieves state-of-the-art results.


Introduction
Conversation is one of the most important approaches for humans to acquire information. Different from traditional machine reading comprehension (MRC), conversational machine reading comprehension (CMRC) requires machines to answer multiple follow-up questions according to a passage and dialogue history. However, these questions usually have complicated linguistic phenomena, such as co-reference, ellipsis and so on. Only considering conversation context profoundly can we answer the current question correctly.
Recently, many CMRC datasets, such as CoQA (Reddy et al., 2019) and QuAC (Choi et al., 2018), are proposed to enable models to understand passages and answer questions in dialogue. Here is an example from the CoQA dataset in Figure 1. We can observe that the second and third questions omit key information. It is impossible for both hu-Billy went to the farm to buy some beef for his brother's birthday. When he arrived there, he saw that all six of the cows were sad and had brown spots. The cows were all eating their breakfast in a big grassy meadow … …  mans and machines to understand such questions without dialogue history. Most of existing methods consider conversation history by prepending previous questions and answers to the current question, such as BiDAF++ (Yatskar, 2019), DrQA+PGNet (Reddy et al., 2019), SDNet  and so on. However, the latent semantic information of dialogue history is neglected. And the model may confuse some unrelated questions and answers in a sentence. Although FlowQA (Huang et al., 2019) utilizes intermediate representations of previous conversation, the flow mechanism can not synthesize the information of different words in different turns of conversation simultaneously. Moreover, previous models only use recurrent neural network (RNN) as their main skeleton, which is not parallel due to recurrent nature. And RNN can only grasp information from two directions, either forward or backward. But for conversation, humans usually consider history from different perspectives and answer questions comprehensively.
To address these issues, we propose a novel model, i.e. Multi-perspective Convolutional Cube (MC 2 ). Every conversation is represented as a

Approaches
In this section, we propose our novel model, MC 2 , for the task of conversational machine reading comprehension, which can be formulated as follows. For one conversation, given a passage with n tokens P = {p i } n i=1 and multiple questions with c turns Q = {Q t } c t=1 , machines need to give the corresponding answers A = {A t } c t=1 . The t-th question with m tokens is Q t = {q t j } m j=1 . The neural network is required to model the probability distribution p(A t |Q ≤t , P ) for the t-th QA turn in the conversation. As shown in Figure 2, there are three main layers in our model, i.e., contextual encoding layer, interaction reasoning layer and answer prediction layer. Our proposed cube is used in the middle layer. For convenience, we will illustrate our model from bottom to top.

Contextual Encoding Layer
The purpose of this layer is to extract useful information for upper layers. We embed questions and passages into a sequence of vectors with the latest contextualized model, BERT (Devlin et al., 2019), separately. Instead of fine-tuning BERT with extra scoring layers, we fix the weights of BERT like SDNet  and aggregate L hidden layers generated by BERT as contextualized embedding for all BPE (Sennrich et al., 2016) tokens.
To introduce other linguistic features token by words and facilitate answer selection, we choose the first token of a word in BPE to represent the word. Generally, the first token is often the root of the word and can represent main meaning of the whole word. And it also contains information of rest tokens in the word with the bidirectional structure of BERT. Besides, we split the long sentence by shorter windows and combine them again when the sentence exceeds the maximum length of pre-trained BERT.
In detail, suppose h l i ∈ R d is the l-th hidden layer of the first BPE token in the i-th word. We collapse all hidden layers generated by BERT into a single vector for each word following ELMo (Peters et al., 2018). The contextualized embedding for the i-th word is e i = γ L l=0 α l h l i , where γ is designed to scale the vector and α l is softmax-normalized weight for the l-th layer. These weights are all trainable. To be consistent with the number of turns of question s We then concatenate these features and embeddings to r P t,i for passages and employ bidirectional RNN to refine the question to r Q t,j .

Interaction Reasoning Layer
This layer plays an important role in our model, which aims to incorporate question information into passage representation further and reason from different perspectives by our proposed convolutional cube. The cube represents the hidden states of passages in a conversation. We will describe these perspectives in Figure 3 in the order of x to } in Figure 2. To consider global context of each turn besides local information across different dimensions, Perspective I equipped with RNN is inserted before other CNN perspectives. We first observe the cube from Perspective I and feed the hidden states of the cube r P t,i to bidirectional RNN for each turn of conversation c P t,i = BiRNN(c P t,i−1 , r P t,i ). Then the cube is viewed from Perspective II along QA turns for different words, separately. Since the (t+1)-th turn of information can not be used when processing the t-th turn, we employ 1D causal convolution (Oord et al., 2016) to the cube by moving the padding at the end to the beginning. And the representation of the cube can be updated from c P t,i intoc P t,i . After viewed from these two perspectives (x y in Figure 2), the hidden states of every word in passages grasp information from two dimensions of the cube.
Next, we observe the cube from Perspective I again to fuse previous hidden states and generate global contextĉ P t,i for each turn of conversation. To reason from more dimensions simultaneously, 2D CNN is utilized to generate hidden states of the cube h P t,i along the dimension of both QA turns and passage words from Perspective III-1. Different from other models, three kinds of information can be considered comprehensively by this process: the same word in different QA turns, different words in the same QA turn and different words in different QA turns. Similar to 1D CNN above, the 2D CNN also requires to be unidirectional on the dimension of QA turns to avoid information leakage. But it is more reasonable to capture bidirectional information on the dimension of passage words. We thus extend traditional causal convolution partially to 2D CNN by moving padding only on one dimension. These two perspectives (z { in Figure 2) strengthen the representation of our cube further.
For questions in this layer, we pass them as the input to another RNN for reasoning h Q t,j = BiRNN(h Q t,j−1 , r Q t,j ). Then we employ the attention score function mentioned above to integrate new information of questions to passages.
As shown in Figure 2, we repeat the process of z { in | } for deeper understanding and reasoning. RNN  from Perspective III-1. We use self-attention to enhance the current passage representation as follows:  At last, we view the cube from Perspective I again to synthesize the global informationĥ P t,i = BiRNN(ĥ P t,i−1 , [h P t,i ;h P t,i ; h self t,i ]).

Answer Prediction Layer
This layer is the top one of our model. We use similar methods (Chen et al., 2017;Huang et al., 2019; to predict the position of the answer in the passage. We project the question representation into one vector for each turn of dialogueĥ Q t = m j=1 a t,j h Q t,j , where a t,j = exp(W h Q t,j )/ m k=1 exp(W h Q t,k ) and W is trainable. Then two different bilinear attention functions are used to estimate the probability of the start and end according toĥ P t,i andĥ Q t . We choose the position of the maximum product of these two probabilities as the best span. For other answer types, such as yes, no and unknown, we condense the passage representationĥ P t,i toĥ P t like questions and classify the answer according to [ĥ P t ;ĥ Q t ]. To train the cube, we minimize the sum of the negative log probabilities of the ground truth start position, end position and answer type by the predicted distributions.

Data and Metric
We conduct our experiments on the CoQA (Reddy et al., 2019), a large-scale CMRC dataset annotated by human. It consists of 127k questions with answers collected from 8k conversations over text passages. As shown in Table 1, it covers seven diverse domains (five of them are in-domain and two are out-of-domain). The out-of-domain passages only appear in the test set. Aligned with the official evaluation, F1 score is used as the metric, which measures the overlap between the prediction and the ground truth at word level.

Implementation Details
We use pre-trained BERT LARGE model for contextualized embeddings, the dimension of which is 1024. And spaCy is applied for tokenization, part-of-speech and named entity recognition. The last turn of the answer is added to the next turn as guidance in the dataset. Each batch contains one cube for one conversation. We employ LSTM as the structure of RNN, the hidden size of which is 250 throughout our model. The kernel size is set to 5 and 3 for 1D and 2D CNN, respectively. And the dropout rate is set to 0.4. The Adamax (Kingma and Ba, 2015) is used as our optimizer with 0.1 learning rate.

Result
We compare our MC 2 with other baseline models 2 in Table 1: PGNet (See et al., 2017), DrQA (Chen et al., 2017), DrQA+PGNet (Reddy et al., 2019), Augmented DrQA (Reddy et al., 2019), BiDAF++ (Yatskar, 2019), FlowQA (Huang et al., 2019 and SDNet . Our model achieves significant improvement over these published models. Comparing with the previous state-of-the-art model, SDNet, our model outperforms it by 3.2% on F1 score. And SDNet also takes pre-trained BERT as embedding without fine-tuning. Especially, our single model surpasses the ensemble model of both FlowQA and SDNet. Figure 4 shows the gap between in-domain and out-of-domain on the test set. Although all mod-1 SDNet comes from experiments of the original author. SDNet * refers to the proportion of Fig. 2 in the original paper. 2 We only consider published models on the CoQA. Although some models perform better on the leaderboard recently, they usually focus on fine-tuning BERT model. els perform worse on out-of-domain datasets compared to in-domain datasets, our model only drops 3.8% on F1 score. It is the smallest drop between in-domain and out-of-domain among all models, which proves that our model has very good generalization ability. Besides, our model achieves the best performance on both in-domain and out-ofdomain datasets.
The learning curve is shown in Figure 5. It reflects the performance of models under different training epochs on the development set. We can observe that our model completely surpasses SD-Net at every epoch. And it outperforms all baseline models only after 5 epochs and achieves the best performance after 18 epochs. Especially, our model achieves 72.472% on F1 score only after the first epoch, which is about 10% to 20% higher than SDNet. Thus with fewer training epochs, our model still can perform well.

Ablation Studies
To study how each perspective of our proposed cube contributes to the performance, we conduct an ablation analysis on the development set in Table 2. The results show that removing all CNN perspectives of the cube, i.e. y { } in Figure 2, will cause a substantial performance drop (3.90% on F1 score). And removing any of them also results in marginal decrease in performance. It is clear that the improvement of reading from different perspectives simultaneously is larger than that of the sum of reading from single perspective separately. Besides, replacing 2D CNN (Perspective III-1) with 1D CNN (Perspective II) also causes a significant decline of performance (0.79% on F1 score). We also explore 3D CNN (Perspective III-2), but it brings no improvement as expected.

Conclusion
In this paper, we introduce Multi-perspective Convolutional Cube (MC 2 ), a novel model for conversational machine reading comprehension. The cube is viewed from different perspectives to fully understand the history of conversation. By integrating CNN with RNN, fusing 1D and 2D convolutions, extending causal convolution to 2D, our model achieves the best results among published models on the CoQA dataset without fine-tuning BERT. We will study further the capability of our approaches on other datasets and tasks in the future work.