Response Selection for Multi-Party Conversations with Dynamic Topic Tracking

While participants in a multi-party multi-turn conversation simultaneously engage in multiple conversation topics, existing response selection methods are developed mainly focusing on a two-party single-conversation scenario. Hence, the prolongation and transition of conversation topics are ignored by current methods. In this work, we frame response selection as a dynamic topic tracking task to match the topic between the response and relevant conversation context. With this new formulation, we propose a novel multi-task learning framework that supports efficient encoding through large pretrained models with only two utterances at once to perform dynamic topic disentanglement and response selection. We also propose Topic-BERT an essential pretraining step to embed topic information into BERT with self-supervised learning. Experimental results on the DSTC-8 Ubuntu IRC dataset show state-of-the-art results in response selection and topic disentanglement tasks outperforming existing methods by a good margin.


Introduction
In recent years, with the influx of deep learning methods in natural language processing (NLP), there has been a lot of interests in building effective task-oriented dialogue systems that can assist people in real-world business such as booking tickets, ordering food and solving technical issues (Bui, 2006). Retrieval-based response generation that selects a suitable response from a pool of candidates (pre-existing human responses) has become a popular approach to framing dialog. Compared to the generation-based systems that generate novel utterances (Serban et al., 2016), retrieval-based systems produce fluent, grammatical and informative responses (Weston et al., 2018;Henderson et al., <_tim ello> sor r y, but I lost my link, r epeating the question: anybody know s w hy I can't play any .m pg, etc? it show s m e the sound, but not show s m e the scr een <Nafallo> _tim ello: pr obably m issing codecs <Nafallo> hm m , anyoneelse got tr oubles w ith docbook-dssslver sion 1.78-4? <danhunt> Check http://w w w.desktopos.com /r eview s.php?op=Pr intReview &id=21 for .m pg tips. <Nafallo> it works now :-P, takes a bit more to downgrade through aptitude than upgrade ;-) <_tim ello> danhunt, I installed m player and the essential codecs package, but it still isn't w or king. I didn't find w hy <Nafallo> Well can I m ove the dr ives <danhunt> Nafallo: you can? t m ove the dr ives, definitely not. This is the pr oblem w ith RAID : ) <Nafallo> danhunt: haha, yeah <Nafallo> _tim ello: r un m player fr om a ter m inal and check the output? Figure 1: A (truncated) multi-party conversation from Ubuntu IRC log. Curved arrows show the 'reply-to' links between utterances. We use different colors to represent different conversation topic clusters. 2019). Also compared to the traditional modular approach, it does not rely on dedicated modules for language understanding, dialog management, and generation, thus simplifying the system design. Due to these reasons, retrieval-based systems have been widely adopted in commercial dialogue systems .
Initially, researchers considered response selection in single-turn conversations, where only the last input utterance is considered as the context query . More recent work deals with multi-turn context, which shows improvements over the single-turn context (Lowe et al., 2015(Lowe et al., , 2017Zhou et al., 2016;Chen and Wang, 2019;Gu et al., 2019;Zhou et al., 2018). These methods typically aim to encode the context and the candidate responses in a joint semantic space by capturing short and long range dependencies, and then retrieve the most relevant response by matching the query representation against each candidate's representation through attentions.
However, most of these works are limited to only two-party conversations. As dialogue research progresses, it is necessary to study the more generic multi-party multi-turn scenario, which has become very common (e.g., Slack, Whatsapp) with the advent of Internet and mobile devices, and posits a unique set of challenges for the dialog models (Kim et al., 2019;. Multiple ongoing conversations seem to occur more naturally in multi-party conversations. For example, consider the conversation excerpt in Figure  1 among three participants, taken from the Ubuntu IRC corpus. There are three ongoing conversation topics as highlighted by different color, and the participants contribute to multiple topics simultaneously (e.g., Nafallo participates in three and danhunt participates in two). An effective response selection method should model such complex conversational topic dynamics in the context, for which existing methods are deficient. In particular, a proper response should match with its context in terms of the same conversation topic, while ignoring other non-relevant topics.
To address the aforementioned challenges in multi-party multi-turn dialog, we frame response selection as a dynamic topic tracking task with the intuition that the topic should remain the same as we go from the context to the response. Our formulation is also supported by the Segmented Discourse Representation Theory (SDRT) of conversations (Asher and Lascarides, 2003). Based on this new formulation, we propose a novel architecture that can incorporate other related dialog tasks such as conversation disentanglement, enabling multi-task learning in a unified framework.
Crucially, our formulation of the task needs to encode only two utterances at a time, thus allowing efficient encoding via large pretrained models like BERT (Devlin et al., 2018). Furthermore, it facilitates pretraining of BERT-like models on topic related sentence pairs to incorporate topic relevance in pretraining, which can be done on large dialog corpora with self-supervised objectives, requiring no manual topic annotations, and can benefit not only response selection but also other dialog tasks. In summary, our contributions are: • A new formulation of the response selection task with an efficient multi-task learning framework for dynamic topic tracking, which supports efficient encoding with only two utterances at once.
• Incorporate topic prediction and topic disentanglement as auxiliary tasks within the framework.
Based on the similarity of these three tasks, the objective is to match topic (topic prediction) between context utterance and response and track response's topic (topic disentanglement) across contexts to select an appropriate response.
• Propose Topic-BERT as a pretraining step to embed topic information into BERT, and use a selfsupervised approach to generate topic sentence pairs from existing dialogue datasets. The incorporated topic information is shown to be a key step to our topic tracking framework.
• Apply topic attention by using topic embedding as query to obtain utterance-level embeddings for topic prediction. Then self-attention was applied to capture the contextual topic vectors for response selection and topic disentanglement.
• Evaluate the proposed models on the DSTC-8 Ubuntu IRC dataset (Kim et al., 2019), and show state-of-the-art results in both response selection and topic disentanglement outperforming the existing methods by a good margin.
2 Related Work

Response Selection
A dual encoder framework was proposed to match the context and response (Lowe et al., 2015), and the long short-term memory (LSTM) was utilized to learn the long and short term dependencies among tokens. Beyond tokens, the sentence view matching was introduced by applying a hierarchical recurrent neural network to model sentence level relationships (Zhou et al., 2016). However, context utterances and response are encoded separately without interaction; thus the semantics extracted from context are not based on the response. Recent approaches such as Sequential Matching Netowrk (SMN)  leverage the contextual information by matching each contextual utterance with response and the multi channel Convolutional Neural Network (CNN) was proposed to generate multiple levels of granularity of matched segment. These hierarchy-based methods use LSTM to encode the text, which is not cost effective to capture multi-grained segment representation (Lowe et al., 2015;Zhou et al., 2016;. A particular work on sequence-based method stand out in DSTC-7; Enhanced Sequential Inference Model (ESIM) (Chen et al., 2017) achieves the state-of-the-art performance in DSTC-7 by taking advantage of inter-sentence matching (Chen et al., 2016;Chen and Wang, 2019). It converts multiturn dialogue setting to natural language inference setting. In addition, transformer-based approach Deep Attention Matching (DAM) solve response selection problem by attention mechanism (Zhou et al., 2018). It utilizes utterance self-attention and context-to-response cross attention to leverage the hidden representation at multi-grained level. Similar to DAM, Multi-hop Selector Network (MSN) was proposed to fuse and select relevant context utterances and match it with the response utterance (Yuan et al., 2019). In addition, Tao et al. (2019) studied the relationship between context utterance and response which indicates that the depth of interaction affect the effectiveness of the model.
Compared to LSTM-based approaches, methods based on transformers (Vaswani et al., 2017) present a promising performance in both accuracy and efficiency (Yang et al., 2020). Devlin et al. (2018) proposed BERT, a transformer-based largescale pretrained language model, which achieves state-of-the-art performance in different NLP tasks. BERT is also a good match to response selection problem as shown by Vig and Ramea (2019). Our Topic-BERT is initialised with BERT base and posttrained with topic related sentence pairs.

Hard Context Retrieval
The side effect of multi-speaker multi-turn context is crucial; a lot of noise will be introduced in the context utterances. The speaker and addressee information are essential to decide the structure of conversation, thus can also benefit conversational response selection (Zhang et al., 2017;Le et al., 2019;. A hard context retrieval method was proposed by Wu et al. (2020b) to minimize the context size, while keeping only the utterances whose speaker is the same as the response candidates or referred by the response candidates. However, it cannot guarantee clean context with a single topic of conversation. Indeed, topic tracking is necessary along with hard context retrieval.

Conversation Disentanglement
Traditional statistical learning based approaches and linguistic features have shown to be effective for conversation disentanglement (Mayfield et al., 2012;Du et al., 2016). Recent methods demonstrate that neural networks could be applied to have a better linguistic representation of the utterances to retrieve relevant conversation. Hand crafted features and pretrained word embeddings are utilized to predict the link-to relationship between utterances . Recently, BERT has been adapted in disentangling task to capture the semantics across utterances (Gu et al., 2020). Also, a masked transformer has been applied to learn the graphical representation of utterances based on the reply-to links .

Task Formulation
Our Topic-BERT framework combines response selection task with two auxiliary tasks, which are topic prediction and topic disentanglement.

Response Selection
Our primary task is response selection in multi-party multi-turn con- . . , w i,m } starts with its speaker s i and is composed of m words. Similarly, a response r i,j has a speaker s i,j and composed of n words. y i,j ∈ {0, 1} represents the relevance label. Our goal is to find the relevance ranking score f θr (c i , r i,j ) with model parameters θ r .
Topic Prediction For this (auxiliary) task, we assume a multi-party conversation with a single conversation topic.
is a topic prediction dataset, where r + i is a positive (same conversation) response and r − i,j is a negative (difference conversation) response for context c i . For our training purposes, each utterance pair from the same context constitutes (c i , r + i ), whereas an utterance pair from different contexts constitutes (c i , r − i,j ). Our goal is to train a binary classifier g θt (c i , r i ) ∈ {0, 1} with model parameters θ t .
Topic Disentanglement In this (auxiliary) task, our goal is to disentangle single conversations from a multi-party conversation based on topics. For a given conversation context c i = {u 1 , u 2 , . . . , u n }, a set of pairwise "reply-to" annotations R = {(u c , u p ) 1 , . . . , (u c , u p ) |R| } is given, where u p is a parent of child u c . Our task is to compute a reply-to score h θ d (u i , u j ) for j ≤ i that indicates the score for u j being the parent of u i , with model parameters θ d . The individual conversations can then be constructed by following the reply-to links. Note that an utterance u i can point to itself, which we call self-link. Self-links are either start of a conversation or a system message, and they play a crucial role in identifying the conversation clusters.

Our Topic-BERT Framework
Our framework for response selection aims to track how the conversation topics change from one utterance to another and use it for ranking the candidate responses. As shown in Fig. 2, we encode an utterance u k from the context c i = {u 1 , u 2 , . . . , u n } along with a candidate response r i,j using our pretrained Topic-BERT encoder ( §4.1). The contextual token representations in Topic-BERT encode topic relevance between the tokens of u k and the tokens of r i,j , while the [CLS] representation captures utterance-level topic relevance. We use the [CLS] representation as query to attend over the token representations to further enforce topic relevance in the attended topic vector tk .
We repeat this encoding process for the n utterances in the context c i to get n different topic vectors T j = {t 1 , . . . , t n } that model r i,j 's topic relevance to the each of context utterances. These topic representations are then used for the prediction tasks -topic prediction, disentanglement, and response selection. Response selection is our main task, while the other two tasks are auxiliary and optional. Since our Topic-BERT encodes two utterances at a time, the encoding process is efficient and can be used to encode larger context. The core component of our framework is the Topic-BERT pretraining as we describe next.

Topic-BERT Pretraining
One crucial advantage of our topic-based task formulation is that it allows us to pretrain BERT directly on a very relevant task in a self-supervised way, without requiring any human annotation. In other words, our goal is to pretrain BERT such that it can be used to encode relevant topic information for our task(s). For this, we assume that a singlethreaded conversation between two or more participants covers a single topic and the utterance pairs in that thread can be used to pretrain our Topic-BERT with relevant self-supervised objectives.
To collect such single-threaded conversational data in an opportunistic way, we can simply adopt the heuristics (unsupervised) used by Lowe et al. (2015) to collect the popular Ubuntu Dialogue Corpus from multi-threaded chatlogs. Alternatively, we can extract two-party conversations from other sources as done in previous work (Henderson et al., 2019;Wu et al., 2020a). In our experiments, we use the data from DSTC-8 task 1 (Kim et al., 2019), which was automatically collected from Ubuntu chat logs. This dataset contains detached speakervisible conversations between two or more participants from the Ubuntu IRC channel.
To pretrain Topic-BERT, we first initialise it with the pretrained uncased BERT base (Devlin et al., 2018). We treat the training setting similar to our topic predection task in §3. Formally, the pretrain- , where each utterance pair from the same conversation (including the true response) constitutes a positive pair (u i , r + i ), and for each such positive pair we randomly sample 4 negative responses (r − i,j ) from the 100 candidate pool to balance the positive and negative ratio. We (re)train Topic-BERT on D pr with two self-supervised objectives as follows. Similar to the original BERT's Next Sentence Prediction (NSP) task, the position embedding, segment embedding and token embedding are added together to get input layer token representations. The token representations are then passed through multiple transformer (Vaswani et al., 2017) encoder layers, where each layer is comprised of a self-attention and a feed-forward sublayer. Different from the original BERT, Topic-BERT uses the [CLS] representation to predict whether the training instance is a positive (same topic) pair or a negative (different topic) pair. Thus, the [CLS] representation encodes topic relationship between the two utterances and will be used as the topicaware contextual embedding to determine whether the two utterances are matched in topic.

Topic-BERT Multi-Task Framework
As shown in Fig. 2(b), the encoded representations from our Topic-BERT are passed through a topic attention layer ( §4.2.1) to get the corresponding topic vectors, which are then used for the end tasks.
Pretrained Topic-BERT  Figure 2: Overview of Topic-BERT architecture. (a) Topic-BERT pretraining with topic sentence pairs to incorporate utterance-utterance topic relationship. (b) Our multi-task framework which uses the pretrained Topic-BERT to enhance topic information in the encoded representations to support three downstream tasks -response selection as the main task while topic prediction and disentanglement as two auxiliary (optional) tasks.

Topic Attention Layer
We apply an attention layer to enhance topic information in the encoded vector. We use the Topic-BERT's [CLS] representation T CLS as query to attend to the remaining K tokens {T j } K j=1 : where v a , W a and U a are trainable parameters. The concatenation of T topic and T CLS constitutes the final topic vector, i.e., t = [T CLS ; T topic ]. We repeat this encoding process for the n utterances in the context c i = {u 1 , u 2 , . . . , u n } by pairing each with the candidate response r i,j to get n different topic vectors T j = {t 1 , . . . , t n }. T j represents r i,j 's topic relevance to the context utterances, which will be fed to the task-specific layers.

Topic Prediction
Topic prediction is done for each utteranceresponse pair (u k , r i,j ) for all u k ∈ c i to decide whether u k and r i,j should be in the same topic ( §3). The Topic-BERT encoded topic vector corresponding to the (u k , r i,j ) pair is t k ∈ T j . We define the binary topic classification model as: where w p is the task-specific parameter. We use a binary cross entropy loss computed as: where y ∈ {0, 1} is the ground truth indicating same or different topic. Note that topic prediction is an auxiliary task intended to help our main task of response selection, as we describe next.

Response Selection
In response selection, our goal is to measure relevance of a candidate response r i,j with respect to the context c i . For this, we first apply the same hard context retrieval method proposed by Wu et al. (2020b) to filter out irrelevant utterances and to reduce the context size. Then, we put each context utterance paired with the response r i,j as the input to Topic-BERT to compute the corresponding topic vectors T j through the topic attention layer. We pass the topic vectors T j ∈ R n×d through a scaled dot-product self-attention layer (Vaswani et al., 2017) to learn all-pair topic relevance at the utterance level. Formally, where {W q , W k , W v } ∈ R n×d are the query, key and value parameters, respectively, and d denotes the hidden dimension of 768.
We add a max-pooling layer to select the most important information followed by a linear layer and a softmax to compute the relevance score of the response r i,j with the context c i . Formally, where W r is the task-specific parameter. We use the standard cross entropy loss defined as: where ✶(y i,j ) is the one-hot encoding of the ground truth label.

Topic Disentanglement
For topic disentanglement ( §3), our goal is to find the "reply-to" links between the utterances (including the candidate response) to track which utterance is replying to which previous utterance.
For training on topic disentanglement, we simulate a sliding window over the entire (entangled) conversation. Each window constitutes a context c i = {u 1 , u 2 , . . . , u n } and the model is trained to find the parent of u n in c i , in other words, we try to find the reply-to link (u n , u np ) for 1 ≤ n p ≤ n.
For the input to our Topic-BERT (Fig. 2b), we treat u n as the response, thus allowing also response-response (u n , u n ) interactions through Topic-BERT's encoding layers to facilitate self-link predictions (the fact that u n can point to itself).
In the task-specific layer for disentanglement, we take the self-attended topic vectors T ′ j = {t ′ 1 , . . . , t ′ n } as input, and separate it into two parts: context topic vectors encapsulated in T ′ c = {t ′ 1 , . . . , t ′ n−1 } ∈ R (n−1)×d and the response topic vector t ′ n ∈ R d . In order to model high-order interactions between the response and context utterances, we compute the differences and elementwise products between them (Chen and Wang, 2019). We duplicate the response message t ′ n to obtain T ′ r ∈ R (n−1)×d and concatenate them as: Then, we compute the reply-to distribution as: h θ d (u n , c i ) = softmax(T ′′ w d ) ∈ R n×1 , and optimize with the following cross-entropy loss: For inference, we compute arg max j h θ d (u n , c i ).

Multi-task Learning
We jointly train the three tasks: response selection, topic prediction and topic disentanglement, which share the same topic attention weights to benefit each other. Response selection should benefit from dynamic topic prediction and disentanglement. Similarly, topic prediction and disentanglement should benefit from the response prediction. The overall loss is a combination of the three task losses from Equations 5, 8, and 10: where α, β, and γ are parameters which are chosen from [0, 0.1, 0.2, ..., 1] by optimizing our model response selection accuracy on dev dataset.

Experiments
In this section, we present our experiments, including the datasets, experimental setup, evaluation metrics, and the results with analysis.

Datasets and Setup
Considering multi-party conversations, we adopt a publicly available Ubuntu dataset from DSTC-8 track 2 "NOESIS II: Predicting Responses" (Kim et al., 2019). This dataset consists of four tasks and we use the datasets from three of them, including Task 1: single-topic multi-party dialogues for response selection; Task 2: a long Ubuntu chat log with multiparty conversations of multiple on-going topics simultaneously, which is ideal for our main response selection evaluation; Task 4: multi-party chat with link annotations (used for disentanglement task). Table 1 shows the dataset statistics. More details about the datasets, experimental setups and training details can be found in Appendix.
Evaluation Metrics DSTC-8 Track 2 considered a range of metrics for comparing models. We follow their evaluation metrics and the details could be found in Appendix.

Experiment I: Response Selection
Baseline Models. We compare the proposed Topic-BERT approach with several existing and state-of-the-art approaches for response selection: • BERT. We adopt the vanilla pretrained uncased BERT base 2 as the base model, and follow (Gu et al., 2020) to post-train BERT base for 10 epochs on DSTC-Task 1 (response selection in a singletopic dialog). We take the whole context with the response as one input sequence. We then finetune it on Task 2's response selection for 10 more epochs. More details can be found in Appendix.
• ToD-BERT. This is a domain-specific pretrained BERT from Wu et al. (2020a), which is pretrained on a combination of 9 Task-oriented Dialogue datasets and surpasses BERT in several downstream response selection tasks.
• Adapt-BERT. This is based on BERT model with task-related pretraining and context modeling through hard and soft context modeling, and ranks as top-1 in the DSTC-8 response selection challenge (Wu et al., 2020b). Table 2, we can see that our Topic-BERT model outperforms the baselines by a large margin. By examining our model in detail, we found that our context filtering, self-supervised topic training and topic attention contribute positively to our model, boosting the metric of Re-call@1 from 0.287 (BERT base ) to 0.696 (Topic-BERT with standalone response selection task). This shows our topic pretraining with task related data improves BERT for response selection task. Furthermore, the performance continues to increase from 0.696 to 0.710, when we jointly train response selection and topic prediction (2nd last row), validating an effective utilization of topic information in selecting response. Then we replace topic prediction with disentanglement, which further improves from 0.710 to 0.720, showing response selection can utilize topic tracing by sharing the connection of utterances. Finally, our Topic-BERT with the multi-task learning achieves the best result (0.726) and significantly outperform the prior state-of-the-art Adapt-BERT in DSTC-8 response selection task (Kim et al., 2019).

Results. From
We further compute BLEU4 SacreBLEU (Post, 2018) for the incorrectly selected responses by Topic-BERT and ToD-BERT. From Table 3, we   see that responses retrieved by Topic-BERT are more relevant even if they are not the top one.

Experiment II: Topic Prediction
This experiment aims to examine how significant our Topic-BERT can improve over the baselines on the topic prediction task, which is important for both response selection and topic disentanglement.
Baseline Models.
• BERT. We use our post-trained BERT base from §5.1 and fine-tune it on Task 1 topic sentence pairs as our BERT baseline for topic prediction.
• ToD-BERT. We adopt our post-trained ToD-BERT and fine-tune it with our obtained topic sentences pairs as the ToD-BERT baseline.
Results. Table 4 gives the topic prediction results on DSTC-8 task-1. From the results, we can see that our Topic-BERT outperforms the baselines BERT and ToD-BERT significantly in the topic prediction task. Compared with our pretrained Topic-BERT without fine-tuning (last row), the proposed topic attention further enhances the topic matching of two utterances by improving the F-score by 1.5% (from 0.813 to 0.828). Joint training with response selection or disentanglement tasks show similar effect on topic prediction tasks, and the contextual topic information sharing by Topic-BERT   multi-task model add a marginal improvement in topic prediction. Compared with vanilla BERT, ToD-BERT (Wu et al., 2020a) makes substantial improvement for the topic prediction task, but not as significant as ours. This further confirms the importance and efficacy of our learning scheme. Meanwhile, if we compare our pretrained Topic-BERT without fine-tuning (last row) with the BERT model that does not use STP (first row), the significant improvement gives us an impression on how much our Topic-BERT benefits from the STP loss.

Experiment III: Disentanglement
This experiment aims to examine how well can Topic-BERT tackle the topic disentanglement task.
Baseline Models.
• BERT & ToD-BERT We use our fine-tuned BERT and ToD-BERT models in §5.2 as our baselines by taking the history of utterances (u 1 , . . . , u n−1 , u n ) and pair each with the current utterance u n itself from a dialogue as input. Following (Gu et al., 2020), A single-layer BiLSTM is applied to extract the cross message semantics of [CLS] outputs. Then we take the differences and element-wise products (Eq. 9) between the  history and current utterance. Finally, a feedforward layer is used for link prediction.
• Feed-Forward. This is the baseline model 3 from DSTC-8 task organizers that has the best result for task 4 , which is trained by employing a two-layer feed-forward neural network on a set of 77 hand engineered features combined with word average embeddings from pretrained Glove embeddings.
• Masked Hierarchical (MH) BERT. This is a two-stage BERT proposed by  to model the conversation structure, in which the low-level BERT is to capture the utterancelevel contextual representation between utterances, and the high-level BERT is to model the conversation structure with an ancestor masking approach to avoid irrelevant connections.
Results. From the results in Table 5, we can see that our Topic-BERT achieves the best result and outperforms all the BERT based baselines significantly. This shows our multi-task learning can enrich the link relationship for improving disentanglement together with topic prediction and response selection. The improvement of Topic-BERT over the baseline model using feed-forward network and hand-crafted features is relatively less, but our approach is able to avoid manual feature engineering. Many of these features are dataset/domain specific and they do not generalize across datasets/domains.

Experiment IV: Evaluation on New Task
Finally, we examine our Topic-BERT's transferability on a new task based on another Ubuntu Corpus v1 dataset by comparing with various state-of-theart response selection methods in Table 6. Ubuntu Corpus V1 contains 1M train set, 500K validation and 500K test set (Lowe et al., 2015).
Baseline Models. Here we mainly introduce the state-of-the-art baseline: BERT-DPT (Whang et al., 2019), which fine-tunes BERT by optimizing the domain post-training (DPT) loss comprising both NSP and MLM objectives for response selection. Details of other baselines can be found in Appendix.
Results. Our Topic-BERT with standalone response selection task fine-tuned on Ubuntu Corpus v1 outperforms the state-of-the-art BERT-DPT, improved by about 1% for Recall 10 @1. This result shows that the learned topic relevance in Topic-BERT can be potentially transferable to a novel task, the topic information influences the response selection positively, and our utterance-level topic tracking is effective for response selection.

Conclusion
This paper presented a new formulation of response selection in multi-party conversations from a novel dynamic topic tracking perspective. Based on our new formulation, we propose Topic-BERT for response selection in multi-party conversations, which consists of two steps: (1) a topic-based pretraining to embed topic information into BERT with self-supervised learning, and (2) a multi-task learning on our pretrained model by jointly training response selection and dynamic topic prediction and disentanglement tasks. Empirically the proposed Topic-BERT achieved the state-of-the-art results on the DSTC8 Ubuntu IRC datasets.