Fine-grained Post-training for Improving Retrieval-based Dialogue Systems

Retrieval-based dialogue systems display an outstanding performance when pre-trained language models are used, which includes bidirectional encoder representations from transformers (BERT). During the multi-turn response selection, BERT focuses on training the relationship between the context with multiple utterances and the response. However, this method of training is insufficient when considering the relations between each utterance in the context. This leads to a problem of not completely understanding the context flow that is required to select a response. To address this issue, we propose a new fine-grained post-training method that reflects the characteristics of the multi-turn dialogue. Specifically, the model learns the utterance level interactions by training every short context-response pair in a dialogue session. Furthermore, by using a new training objective, the utterance relevance classification, the model understands the semantic relevance and coherence between the dialogue utterances. Experimental results show that our model achieves new state-of-the-art with significant margins on three benchmark datasets. This suggests that the fine-grained post-training method is highly effective for the response selection task.


Introduction
Constructing a dialogue system that can naturally and consistently interact with humans is currently a popular research topic. There are two approaches for the implementation of a dialogue system: generation-based and retrieval-based methods. The latter approach aims to select the correct response among the response candidates. In the initial multi-turn response selection, Lowe et al. (2015) proposed leveraging RNN to match the dialogue context with a response. Later, with the advent of 1 https://github.com/hanjanghoon/BERT_ FP the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017), multiturn response selection models that use the attention mechanism have been proposed (Zhou et al., 2018). Recently, pre-trained language models, such as bidirectional encoder representations from transformers, BERT, have been applied to a variety of response selection models (Vig and Ramea, 2019;Lu et al., 2020;Gu et al., 2020), and they have shown excellent performance.
Recently, pre-trained language models have been widely used in several natural language processing areas, such as question answering and dialogue systems. One of the best pre-trained language models, BERT (Devlin et al., 2019), is initially pre-trained on a large and general domain corpus, and then it is fine-tuned to adapt to specific tasks. Since BERT is pre-trained with general data, its performance can be improved by post-training to adapt to domain-specific data. Some previous studies Whang et al., 2020) proposed a posttraining method that can learn domain data before fine-tuning for a task. In the previous studies, the models were post-trained using domain-specific task data with the same pre-training objectives as BERT, masked language model (MLM) and next sentence prediction (NSP).
To develop a new post-training method that is suitable for dialogue, we propose a simple but powerful fine-grained post-training method. The new post-training method has two learning strategies. The first is to train the model by dividing the entire dialogue into multiple short context-response pairs. The second is to train the model with a new objective called utterance relevance classification, which classifies the relation between given utterances and the target utterance into more fine-grained labels.
The dialogue consists of a context that includes multiple utterances and a response with one utterance. There are two advantages to learning the dialogue by dividing it into multiple new short context-response pairs, rather than learning with the entire context-response pair during post-training. First, the model can learn the interaction between internal utterances, which is overlooked in the previous training methods. The previous multi-turn response selection models focus on identifying the associated information between a context with multiple utterances and the response. To understand the associated information, BERT takes the whole context as input to represent the relationship between the context and the response, instead of gradually expanding and learning the relationship between the utterances inside the context. The relationship between the entire context and response can be learned through self-attention. However, the relationship between the utterances in the dialogue is easily overlooked. To address this issue, we divide the entire dialogue into multiple short context-response pairs. Since each pair consists of internal utterances, the model can learn the utterancelevel interactions. The second advantage is that the model can capture the relationship between the utterances more accurately. In general, the utterances that are related to the response are located close to the response. As short context-response pairs consist only of utterances that are close to the response, more fine-grained training is possible.
Another strategy of fine-grained post-training is that it involves using a new training objective that is called the utterance relevance classification (URC). In the case of the NSP used in BERT, the model distinguishes whether the target utterance is random or the next. As mentioned by Lan et al. (2020), the model trained with the NSP can easily learn the topic prediction that distinguishes the semantic meaning of the utterances. However, it lacks coherence prediction that distinguishes whether the selected utterance is consecutive. In the case of sentence ordering prediction (SOP) used in Lan et al. (2020), the coherence between the utterances is well learned because the order of the two sequences is trained. However, topic prediction is relatively insufficient because the two sequences are semantically similar. As it is important to distinguish between semantically similar utterances in the multi-turn dialogue and determine whether the selected utterances are consecutive, we propose URC, which classifies the target utterance into three categories (random, semantically similar, next) to learn the topics and coherence.
The contributions of our study are summarized as follows: 1. Through short context-response pair training during fine-grained post-training, the model effectively learns the interactions between internal utterances, which can be easily overlooked in the existing methods. This significantly improves the performance of response selection.
2. By devising the new training objective, URC, we enhance the model's capability to measure both the semantic relevance and coherence between utterances, improving the model to select the appropriate response.
We achieved state-of-the-art performance with a significant improvement for three benchmarks (Ubuntu, Douban, E-commerce). Specifically, our model achieved an absolute improvement in R 10 @1 by 2.7%p, 0.6%p, and 9.4%p on Ubuntu Corpus V1, Douban Corpus, and E-commerce Corpus, respectively, in comparison to previous state-of-theart methods. The results indicate the effectiveness and generality of the proposed method.

Related Work
The existing methods for building dialogue systems can be categorized into two groups: those with a retrieval-based approach (Chaudhuri et al., 2018;Tao et al., 2019;Yuan et al., 2019) and those with a generation-based approach Zhou et al., 2018;Hosseini-Asl et al., 2020;Ham et al., 2020). Recent studies have focused on the multi-turn retrieval dialogue system where the system selects the most appropriate response when a multi-turn dialogue context is provided. Lowe et al. (2015) proposed a new benchmark dataset called the Ubuntu internet relay chat (IRC) Corpus V1 and a RNN-based baseline model. Kadlec et al. (2015) suggested a dual encoder-based model that attempts to effectively encode the context and response by using LSTM and CNN as encoder.
With the advent of the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017), models such as the deep attention matching network (Zhou et al., 2018), which applied the attention mechanism to the response selection dialogue system, have been proposed. Chen and Wang (2019) adapted the natural language inference model to the response selection task. Tao et al. (2019) performed a deep interaction between the context and the response through multiple interaction blocks. Yuan et al. (2019) improved the performance by controlling the dialogue context information with a multi-hop selector.
The pre-trained language models have shown an impressive performance in the response selection (Lu et al., 2020;Gu et al., 2020;Whang et al., 2021;Xu et al., 2021). One of those, BERT, is a bidirectional transformer-based encoder that has multiple layers. We use the publicly opened BERT base model in which the number of layers, attention head, and size of the hidden state are 12, 12, and 768, respectively.
There are a variety of training objectives for the pre-trained language models. BERT uses two training objectives: MLM and NSP. The former randomly masks 15% of the tokens that are predicted by the model. This method of training aims for the model to learn the overall contextual representation of a given text. In the latter method, the model is given two sequences of text: A and B. The model is trained to determine if sequence B is the next sequence after sequence A. The model takes the input, sequences A and B, separated by the special token SEP. The model uses the segment embeddings of 0 for sequence A and 1 for sequence B. Then, by using the CLS token, the model predicts the relationship between sequences A and B. AL-BERT (Lan et al., 2020) uses sentence ordering prediction (SOP) instead of NSP as the training objectives. The SOP distinguishes whether the order of sequences A and B is correct or if they have been swapped.
The post-training method, which helps the model understand a certain domain, was introduced in the response selection task (Whang et al., 2020;Gu et al., 2020;Humeau et al., 2020;Whang et al., 2021;Xu et al., 2021). In addition to domain adaptation, the post-training method has the advantage of data augmentation because it learns the relationship between the two sequences in the dialogue session with the NSP. However, the method does not reflect the conversational characteristics because it merely follows BERT's pre-training method. To address this issue, we propose a novel post-training method that is suitable for a multi-turn dialogue. The proposed method achieved better performance in comparison to the previous post-training.

Problem Formalization
is a set of N triples that consist of the context c i , response r i , and ground truth label y i . The context is a sequence of utterances, which is c i = {u 1 , u 2 , ..., u M }, where M is the maximum context length. The j th utterance u j = {w j,1 , w j,2 , ..., w j,L } contains L tokens, where L is the maximum sequence length. Each response, r i , is a single utterance. y i ∈ {0, 1} denotes the truth label of a given triple where y i = 1. This indicates that r i is the correct response for the context c i ; otherwise, y i = 0. The task is to find the matching model, g(·, ·), for the D. The matching degree of c i and r i is obtained through g(c i , r i ) for a given context-response pair (c i , r i ).

Fine-tuning BERT for Response Selection
This study is based on the binary classification to fine-tune BERT for the response selection task that analyzes the relationship between the context and response. The input format (x) of the existing BERT model is ( are CLS and SEP tokens, respectively. To measure the matching degree of a contextresponse pair, we construct the input by using sequence A as a context and sequence B as a response. In addition, the end of the utterance token (EOU) is placed at the end of each utterance to distinguish them in the context. The input format of BERT for the response selection is as follows: x subsequently becomes input representation vectors through the sum of the position, segment, and token embedding. The transformer block in BERT calculates the cross attention between the input representation of the context and the response through the self-attention mechanism. Then, the final hidden vector of the first input token in BERT, T [CLS] , is used as the aggregate representation of the context-response pair. The final score g(c, r), which is the matching degree between the context and the response, is obtained by passing T [CLS] through a single-layer neural network. where W f ine is a task-specific trainable parameter for fine-tuning. Eventually, the weights of the model are updated by using the cross-entropy loss function.

Fine-grained Post-training
To improve the capability of selecting an appropriate response by effectively grasping multi-turn dialogue information, we propose a simple but powerful fine-grained post-training method in Figure  1. The fine-grained post-training method has two learning strategies. The entire dialogue session is divided into multiple short context-response pairs, and URC is used as one of the training objectives. Through the former strategy, the model learns the interaction of the related internal utterances of the dialogue. Through URC, it learns the semantic relevance and coherence between the utterances.

Short Context-response Pair Training
We post-train the model by constructing multiple short context-response pairs using all utterances of the dialogue session to learn the utterance level interaction. We regard every utterance as a response and its previous k utterances as a short context. The short context contains fewer utterances than the average number of utterances in the dialogue sessions. Each short context-response pair is trained to learn the internal utterance interactions, eventually allowing the model to understand the relationship between all the utterances in a dialogue session. It also allows the model to learn the interaction of the utterances closely related to the response because the context is appropriately configured with a short length.

Utterance Relevance Classification
The NSP objective (Devlin et al., 2019) is inadequate for capturing the coherence between the utterances. This is because NSP mainly learns the topic's semantic relevance by classifying between a random and the next utterance. By using the SOP (Lan et al., 2020) as an objective function, the ability to distinguish the semantic relevance decreases because the model learns the coherence of two utterances with a similar topic. To learn both the semantic relevance and the coherence in a dialogue, we propose a new training objective that is called the utterance relevance classification (URC) in Figure 2. The URC classifies the target utterance for a given short context into one of three labels. The first label is a random utterance. Secondly, an utterance, which is not the response, is randomly sampled in the same dialogue session. Although utterances of the same dialogue session have a similar topic to the correct response, they are inappropriate for the coherence prediction. Finally, the correct response is selected. The model learns the topic prediction by performing a classification between the random utterances and correct responses, and the model makes the coherence predictions by classifying the random utterances and correct responses in the same dialogue sessions. By classifying the relationship between the short context and the target utterance into three cases, the model can learn both the semantic relevance information and the coherence information of the dialogue session.

Training Setup
An overview of the fine-grained post-training (FP) method is shown in Figure 1. First, when given the conversation session U i = {u 1 , u 2 , ..., u M , u M +1 = r i }, we select the continuous utterances and form a short context-response pair S j = {u j , u j+1 , ..., u j+k−1 , u j+k } with a context length of k. The model classifies the relationship between a short context sc = {u j , u j+1 , ..., u j+k−1 } and the given target utterance u t . The target utterance can be one of three options: a random utterance u r , a random utterance for the same dialogue session u s , or the response u j+k , where 1 ≤ s ≤ M + 1 and j + k = s. We denote the input sequence x for the fine-grained post-training as follows: [SEP ] u t [SEP ] (4) As an aggregate representation, T [CLS] is used. The final score g urc (sc, u t ) is obtained by feeding T [CLS] through a single-layer perceptron, and the degree of relevance between the short context and target utterance is obtained through the score. To calculate the URC loss, we use the cross-entropy loss, which is formulated as follows: To train the proposed model, we use the MLM and URC together. In the case of the MLM, we apply a dynamic masking technique proposed by RoBERTa , which is unlike BERT. The model can learn more contextual representations because it learns by masking a random token each time instead of learning by masking a predetermined token. To optimize the model, we use the sum of the cross-entropy loss of the MLM and URC, which is formulated as follows: 4 Experiments

Datasets
We tested our model on widely used benchmarks that include Ubuntu Corpus V1, Douban Corpus, and the E-commerce Corpus. The statistics for the three datasets are presented in Table 1.

• Ubuntu Corpus
The Ubuntu IRC Corpus V1 (Lowe et al., 2015) is chatting log conversations, a publicly available domain-specific dialogue dataset. This dialogue data deals with Ubuntu-related topics. In our study, the data proposed by Xu et al. (2017) are used. The data are preprocessed with special placeholders such as numbers, URLs, and system paths.
• Douban Corpus Douban Corpus (Wu et al., 2017) is a Chinese open-domain dataset from the Douban group, which is a popular social networking service. It consists of dyadic dialogues (i.e., a conversation between two people) that is longer than two turns.
• E-commerce Corpus The E-commerce Corpus (Zhang et al., 2018) is a Chinese multi-turn dialogue that is collected from Taobao, which is the largest ecommerce platform in China. It contains realworld conversations between customers and customer service staff. The corpus consists of diverse conversations such as consultations and recommendations.

Post-training Data
For the fine-grained post-training, we reconstructed the three benchmark datasets. Specifically, out of the one million triples in each benchmark's training set, we used 500K positive triples as dialogue sessions. Since multiple short context-response pairs could be created in one dialogue session, we eventually constructed 12M, 9M, and 6M subcontext-response pairs for Ubuntu Corpus, Douban Corpus, E-commerce Corpus, respectively. These sub-context-response pairs were used for the posttraining.

Evaluation Metric
Following the previous works (Tao et al., 2019;Yuan et al., 2019;Gu et al., 2020), we used recall as an evaluation metric. Recall is denoted as R 10 @k, which implies that the correct answer exists among the top k candidates out of the ten candidate responses. Specifically, in the experiment, R 10 @1, R 10 @2, and R 10 @5 were used. Apart from R 10 @k, we also employed M AP (mean average precision), M RR (mean reciprocal rank), and P @1 (precision at one) for the Douban Corpus because the dataset may contain more than one positive response from the candidates.

Baseline Methods
We compared our fine-grained post-trained model, BERT-FP, with the following previous models. For the initial checkpoint, we adapted the BERT base (110M) from Devlin et al. (2019).
• SMN: Wu et al. (2017) decomposes the context-response pair into several utteranceresponse pairs. After matching every utterance and response, the matching vector is accumulated as the final matching score.
• DUA: Zhang et al. (2018) formulates the previous utterances into the context by using a deep utterance aggregation.
• DAM: Zhou et al. (2018) proposed a transformer encoder-based model and calculated the matching score between the context and response through self-attention and crossattention.
• IoI: Through multiple interaction block chains, Tao  • BERT: A vanilla model fine-tuned to the response selection task on the pre-trained BERT base without post-training.
• RoBERTa-SS-DA: Lu et al. (2020) proposed the speaker segmentation approach, which discriminates the different speakers and also applied dialogue augmentation. • SA-BERT: Gu et al. (2020) incorporated speaker-aware embedding to the model; therefore, it is aware of the speaker change information.

Models
Ubuntu Douban E-commerce R 10 @1 R 10 @2 R 10 @5 MAP MRR P @1 R 10 @1 R 10 @2 R 10 @5 R 10 @1 R 10 @2 R 10 @5 TF-IDF (Lowe et al., 2015) 0  (Whang et al., 2020) 0 (Whang et al., 2020) 0.855 0.928 (Gu et al., 2020) 0  • BERT-SL: Xu et al. (2021) introduced four self-supervised tasks and trained the response selection model with these auxiliary tasks in a multi-task manner. Table 2 shows the performance of the proposed BERT-FP that is evaluated on three benchmarks. As you can see in the results, the proposed model outperformed all of the other models used as baselines. In comparison to the vanilla model of BERT, our model achieved an absolute improvement in R 10 @1 by 10.3%p, 4.4%p, and 26%p on Ubuntu Corpus V1, Douban Corpus, and E-commerce Corpus, respectively. Compared to BERT-DPT, our model achieved an absolute improvement of 6%p in R 10 @1 on the Ubuntu Corpus. These results indicate that fine-grained post-training, which reflects the dialogue's characteristics, is superior to the previous post-training. In comparison to the previous state-of-the-art models, UMS BERT + and BERT-SL, our model achieved an improved performance by a large margin in terms of all the metrics for the three benchmarks. These results demonstrate that our method effectively learns the semantic relevance and coherence between the internal utterances, which enhances selection performance significantly.  Figure 3 shows the performance variations of BERT-FP depending on the length of the short context. In this experiment, we trained the models with 10% of the training set and evaluated them with the entire test set to perform many experiments. Therefore, they achieved lower performance. For the Ubuntu Corpus and E-commerce Corpus, the best performance in R 10 @1 is achieved when the context length is three. For Douban Corpus, we evaluated performance with M AP rather than   R 10 @1 because it may have multiple correct responses in the candidates. The best performance in M AP on Douban Corpus is achieved when the context length is set to two.

Performance according to Training Objective
We compared the proposed training objective (URC) with the previous training objectives (NSP, SOP). Table 3 demonstrates that our training objective outperforms the other training objectives. This indicates that learning both topics and the coherence between the internal utterances is important.

Ablation Study
We investigated the impact of each part of the finegrained post-training method through a series of ablation experiments on the Ubuntu Corpus in Table  4. The model without post-training (BERT) is used as the baseline. Then, we gradually applied our methods for post-training. +MLM indicates that the model is post-trained only with the MLM. The "_SCR" suffix denotes the model that is post-trained with the short context-response pairs. The comparison between +MLM and +MLM+NSP shows that the NSP during the existing post-training has little effect on the performance. However, as shown in the comparison between +MLM and MLM+NSP_SCR, the NSP trained with a short context-response pair significantly improved the model performance. The experimental results also showed that using URC instead of NSP enhances performance.

Comparison with the Data Augmentation
The post-training method has the effect of data augmentation. However, it differs from the usual Models R 10 @1 R 10 @2 R 10 @5 BERT (Gu et al., 2020)    data augmentation method, which directly augments the data in the fine-tuning step. Therefore, we compared the fine-grained post-training (BERT-FP) method with the typical data augmentation (BERT-DA) on the Ubuntu Corpus. The data augmentation strategy is similar to the method used in Chen and Wang (2019). We considered each utterance as a response and its previous utterances as its context. The experimental results are shown in Table 5. BERT-FP outperforms the data augmentation model (BERT-DA) by a 3.1%p in R 10 @1. The significant improvement demonstrates the effectiveness of the proposed method in comparison to the data augmentation. Our method, including post-training and fine-tuning steps, is about 2.5 times faster than BERT-DA. In particular, the posttrained model takes much less time to fine-tune than BERT-DA, making them easy to adapt to various applications.

The Effectiveness of Fine-grained Post-Training for Response Selection Task
To demonstrate the effectiveness of the finegrained post-training method for the response selection task, we compared three different models: BERT, BERT-FP, and BERT-FP-NF (no finetuning). BERT-FP-NF is a model that was posttrained and evaluated without fine-tuning. As shown in Table 6, the performance of BERT-FP-NF is close to BERT-FP, which is fine-tuned. These results show that even before fine-tuning to the response selection task, our fine-grained post-training alone could measure the matching degree between the context and the response.

Conclusion
In this paper, we have proposed a new fine-grained post-training method that is suitable for the multiturn dialogue. The proposed method allows the matching model to learn the semantic relevance and the coherence of the utterances in the dialogue, and it improves the model's capability to select the appropriate response. The experimental results on the three benchmark datasets demonstrate our post-training method's superiority for the response selection. From this, our model achieved a new state-of-the-art performance for all three benchmarks.
In the future, we plan to research new posttraining methods that are suitable for a variety of tasks, such as question answering and dialogue generation.