Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering

We introduce a novel approach to transformers that learns hierarchical representations in multiparty dialogue. First, three language modeling tasks are used to pre-train the transformers, token- and utterance-level language modeling and utterance order prediction, that learn both token and utterance embeddings for better understanding in dialogue contexts. Then, multi-task learning between the utterance prediction and the token span prediction is applied to fine-tune for span-based question answering (QA). Our approach is evaluated on the FriendsQA dataset and shows improvements of 3.8% and 1.4% over the two state-of-the-art transformer models, BERT and RoBERTa, respectively.

Several limitations can be expected for language models trained on general domains to process dialogue. First, most of these models are pre-trained on formal writing, which is notably different from colloquial writing in dialogue; thus, fine-tuning for the end tasks is often not sufficient enough to build robust dialogue models. Second, unlike sentences in a wiki or news article written by one author with a coherent topic, utterances in a dialogue are from multiple speakers who may talk about different topics in distinct manners such that they should not be represented by simply concatenating, but rather as sub-documents interconnected to one another. This paper presents a novel approach to the latest transformers that learns hierarchical embeddings for tokens and utterances for a better understanding in dialogue contexts. While fine-tuning for span-based QA, every utterance as well as the question are separated encoded and multi-head attentions and additional transformers are built on the token and utterance embeddings respectively to provide a more comprehensive view of the dialogue to the QA model. As a result, our model achieves a new state-of-the-art result on a span-based QA task where the evidence documents are multiparty dialogue. The contributions of this paper are: 1 • New pre-training tasks are introduced to improve the quality of both token-level and utterance-level embeddings generated by the transformers, that better suit to handle dialogue contexts ( §2.1).
• A new multi-task learning approach is proposed to fine-tune the language model for span-based QA that takes full advantage of the hierarchical embeddings created from the pre-training ( §2.2).
• Our approach significantly outperforms the previous state-of-the-art models using BERT and RoBERTa on a span-based QA task using dialogues as evidence documents ( §3).

Transformers for Learning Dialogue
This section introduces a novel approach for pretraining (Section 2.1) and fine-tuning (Section 2.2) transformers to effectively learn dialogue contexts. Our approach has been evaluated with two kinds of transformers, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), and shown significant improvement to a question answering task (QA) on multiparty dialogue (Section 3).

Pre-training Language Models
Pre-training involves 3 tasks in sequence, the tokenlevel masked language modeling (MLM; §2.1.1), the utterance-level MLM ( §2.1.2), and the utterance order prediction ( §2.1.3), where the trained weights from each task are transferred to the next task. Note that the weights of publicly available transformer encoders are adapted to train the tokenlevel MLM, which allows our QA model to handle languages in both dialogues, used as evidence documents, and questions written in formal writing. Transformers from BERT and RoBERTa are trained with static and dynamic MLM respectively, as de- . . , w in } is the i'th utterance in D, s i is the speaker of U i , and w ij is the j'th token in U i .
All speakers and tokens in D are appended in order with the special token CLS, representing the entire dialogue, which creates the input string sequence where µ ij is the masked token substituted in place of w ij . I µ ij is then fed into the transformer encoder (TE), which generates a sequence of embeddings ., e w in } is the embedding list for U i , and (e c , e s i , e w ij , e µ ij ) are the embeddings of (CLS, s i , w ij , µ ij ) respectively. Finally, e µ ij is fed into a softmax layer that generates the output vector o µ ij ∈ R |V | to predict µ ij , where V is the set of all vocabularies in the dataset. 2

Utterance-level Masked LM
The token-level MLM (t-MLM) learns attentions among all tokens in D regardless of the utterance boundaries, allowing the model to compare every token to a broad context; however, it fails to catch unique aspects about individual utterances that can be important in dialogue. To learn an embedding for each utterance, the utterance-level MLM model is trained (Figure 1(b)). Utterance embeddings can be used independently and/or in sequence to match contexts in the question and the dialogue beyond the token-level, showing an advantage in finding utterances with the correct answer spans ( §2.2.1).  For every utterance U i , the masked input sequence Note that CLS i now represents U i instead of D and I µ ij is much shorter than the one used for t-MLM. I µ ij is fed into TE, already trained by t-MLM, and the embedding sequence ., e w in } is generated. Finally, e c i , instead of e µ ij , is fed into a softmax layer that generates o µ ij to predict µ ij . The intuition behind the utterance-level MLM is that once e c i learns enough contents to accurately predict any token in U i , it consists of most essential features about the utterance; thus, e c i can be used as the embedding of U i .

Utterance Order Prediction
The embedding e c i from the utterance-level MLM (u-MLM) learns contents within U i , but not across other utterances. In dialogue, it is often the case that a context is completed by multiple utterances; thus, learning attentions among the utterances is necessary. To create embeddings that contain crossutterance features, the utterance order prediction model is trained (Figure 1(c)). Let D = D 1 ⊕ D 2 where D 1 and D 2 comprise the first and the second halves of the utterances in D, respectively. Also, let D = D 1 ⊕ D 2 where D 2 contains the same set of utterances as D 2 although the ordering may be different. The task is whether or not D preserves the same order of utterances as D.
For each U i ∈ D , the input I i = {CLS i }⊕U i is created and fed into TE, already trained by u-MLM, to create the embeddings E i = {e c i , e s i , e w i1 , .., e w in }. The sequence E c = {e c 1 , . . . , e c n } is fed into two transformer layers, TL1 and TL2, that generate the new utterance embedding list T c = {t c 1 , . . . , t c n }. Finally, T c is fed into a softmax layer that generates o ν ∈ R 2 to predict whether or not D is in order.

Fine-tuning for QA on Dialogue
Fine-tuning exploits multi-task learning between the utterance ID prediction ( §2.2.1) and the token span prediction ( §2.2.2), which allows the model to train both the utterance-and token-level attentions. The transformer encoder (TE) trained by the utterance order prediction (UOP) is used for both tasks. Given the question Q = {q 1 , . . . , q n } (q i is the i'th token in Q) and the dialogue D = {U 1 , . . . , U m }, Q and all U * are fed into TE that generates E q = {e c q , e q 1 , .., e q n } and E i = {e c i , e s i , e w i1 , .., e w in } for Q and every U i , respectively.

Utterance ID Prediction
The utterance embedding list E c = {e c q , e c 1 , .., e c n } is fed into TL1 and TL2 from UOP that generate T c = {t c q , t c 1 , .., t c n }. T c is then fed into a softmax layer that generates o u ∈ R m+1 to predict the ID of the utterance containing the answer span if exists; otherwise, the 0'th label is predicted, implying that the answer span for Q does not exist in D. Vaswani et al., 2017) then generates the attended embedding sequences, T a 1 , . . . , T a m , where T a i = {t s i , t w i1 , .., t w in }. Finally, each T a i is fed into two softmax layers, SL and SR, that generate o i ∈ R n+1 and o r i ∈ R n+1 to predict the leftmost and the rightmost tokens in U i respectively, that yield the answer span for Q. It is possible that the answer spans are predicted in multiple utterances, in which case, the span from the utterance that has the highest score for the utterance ID prediction is selected, which is more efficient than the typical dynamic programming approach.

Corpus
Despite of all great work in QA, only two datasets are publicly available for machine comprehension that take dialogues as evidence documents. One is DREAM comprising dialogues for language exams with multiple-choice questions (Sun et al., 2019). The other is FRIENDSQA containing transcripts from the TV show Friends with annotation for spanbased question answering (Yang and Choi, 2019). Since DREAM is for a reading comprehension task that does not need to find the answer contents from the evidence documents, it is not suitable for our approach; thus, FRIENDSQA is chosen.
Each scene is treated as an independent dialogue in FRIENDSQA. Yang and Choi (2019) randomly split the corpus to generate training, development, and evaluation sets such that scenes from the same episode can be distributed across those three sets, causing inflated accuracy scores. Thus, we re-split them by episodes to prevent such inflation. For finetuning ( §2.2), episodes from the first four seasons are used as described in Table 1. For pre-training ( §2.1), all transcripts from Seasons 5-10 are used as an additional training set.

Models
The weights from the BERT base and RoBERTa base models (Devlin et al., 2019;Liu et al., 2019) are transferred to all models in our experiments. Four baseline models, BERT, BERT pre , RoBERTa, and RoBERTa pre , are built, where all models are finetuned on the datasets in Table 1 and the *pre models are pre-trained on the same datasets with the additional training set from Seasons 5-10 ( §3.1). The baseline models are compared to BERT our and RoBERTA our that are trained by our approach. 3 Table 2 shows results achieved by all the models. Following Yang and Choi (2019), exact matching (EM), span matching (SM), and utterance matching (UM) are used as the evaluation metrics. Each model is developed three times and their average score as well as the standard deviation are reported. The performance of RoBERTa * is generally higher than BERT * although RoBERTa base is pre-trained with larger datasets including CC-NEWS (Nagel, 2016), OPENWEBTEXT (Gokaslan and Cohen, 2019), and STORIES (Trinh and Le, 2018) than BERT base such that results from those two types of transformers cannot be directly compared. RoBERTa 52.6(±0.7) 68.2(±0.3) 80.9(±0.8) RoBERTa pre 52.6(±0.7) 68.6(±0.6) 81.7(±0.7) RoBERTa our 53.5(±0.7) 69.6(±0.8) 82.7(±0.5) Table 2: Accuracies (± standard deviations) achieved by the BERT and RoBERTa models.

Results
The *pre models show marginal improvement over their base models, implying that pre-training the language models on FRIENDSQA with the original transformers does not make much impact on this QA task. The models using our approach perform noticeably better than the baseline models, showing 3.8% and 1.4% improvements on SM from BERT and RoBERTa, respectively.   Why that often spans out to longer sequences and also requires deeper inferences to answer correctly than the others. Compared to the baseline models, our models show more well-around performance regardless the question types. 4  These two dialogue-specific LM approaches, ULM and UOP, give very marginal improvement over the baseline models, that is rather surprising. However, they show good improvement when combined with UID, implying that pre-training language models may not be enough to enhance the performance by itself but can be effective when it is coupled with an appropriate fine-tuning approach. Since both ULM and UOP are designed to improve the quality of utterance embeddings, it is expected to improve the accuracy for UID as well. The improvement on UM is indeed encouraging, giving 2% and 1% boosts to BERT pre and RoBERTa pre , respectively and consequently improving the other two metrics.

Error Analysis
As shown in Table 3, the major errors are from the three types of questions, who, how, and why; thus, we select 100 dialogues associated with those question types that our best model, RoBERTa our , incorrectly predicts the answer spans for. Specific examples are provided in Tables 12, 13 and 14 ( §A.3). Following , errors are grouped into 6 categories, entity resolution, paraphrase and partial match, cross-utterance reasoning, question bias, noise in annotation, and miscellaneous. Table 5 shows the errors types and their ratios with respect to the question types. Two main error types are entity resolution and cross-utterance reasoning. The entity resolution error happens when many of the same entities are mentioned in multiple utterances. This error also occurs when the QA system is asked about a specific person, but predicts wrong people where there are so many people appearing in multiple utterances. The cross-utterance reasoning error often happens with the why and how questions where the model relies on pattern matching mostly and predicts the next utterance span of the matched pattern.

Conclusion
This paper introduces a novel transformer approach that effectively interprets hierarchical contexts in multiparty dialogue by learning utterance embeddings. Two language modeling approaches are proposed, utterance-level masked LM and utterance order prediction. Coupled with the joint inference between token span prediction and utterance ID prediction, these two language models significantly outperform two of the state-of-the-art transformer approaches, BERT and RoBERTa, on a span-based QA task called FriendsQA . We will evaluate our approach on other machine comprehension tasks using dialogues as evidence documents to further verify the generalizability of this work.