Transfer Learning for Context-Aware Question Matching in Information-seeking Conversations in E-commerce

Building multi-turn information-seeking conversation systems is an important and challenging research topic. Although several advanced neural text matching models have been proposed for this task, they are generally not efficient for industrial applications. Furthermore, they rely on a large amount of labeled data, which may not be available in real-world applications. To alleviate these problems, we study transfer learning for multi-turn information seeking conversations in this paper. We first propose an efficient and effective multi-turn conversation model based on convolutional neural networks. After that, we extend our model to adapt the knowledge learned from a resource-rich domain to enhance the performance. Finally, we deployed our model in an industrial chatbot called AliMe Assist (https://consumerservice.taobao.com/online-help) and observed a significant improvement over the existing online model.


Introduction
With the popularity of online shopping, there is an increasing number of customers seeking information regarding their concerned items. To efficiently handle customer questions, a common approach is to build a conversational customer service system Yang et al., 2018).
In the E-commerce environment, the informationseeking conversation system can serve millions of customer questions per day. According to the statistics from a real e-commerce website , the majority of customer questions (nearly 90%) are business-related or seeking information about logistics, coupons etc. Among these conversation sessions, 75% of them are more than one turn 2 . Hence it is important to handle multiturn conversations or context information in these conversation systems.
Recent researches in this area have focused on deep learning and reinforcement learning (Shang et al., 2015;Li et al., 2016a,b;Sordoni et al., 2015;Wu et al., 2017). One of these methods is Sequential Matching Network (Wu et al., 2017), which matches a response with each utterance in the context at multiple levels of granularity and leads to state-of-the-art performance on two multi-turn conversation corpora. However, such methods suffer from at least two problems: they may not be efficient enough for industrial applications, and they rely on a large amount of labeled data which may not be available in reality.
To address the problem of efficiency, we made three major modifications to SMN to boost the efficiency of the model while preserving its effectiveness. First, we remove the RNN layers of inputs from the model; Second, SMN uses a Sentence Interaction based (SI-based) Pyramid model  to model each utterance and response pair. In practice, a Sentence Encoding based (SE-based) model like BCNN (Yin and Schütze, 2015) is complementary to the SI-based model. Therefore, we extend the component to incorporate an SE-based BCNN model, resulting in a hybrid CNN (hCNN) (Yu et al., 2017); Third, instead of using a RNN to model the output representations, we consider a CNN model followed by a fully-connected layer to further boost the efficiency of our model. As shown in our experiments, our final model yields comparable results 2 According to a statistic in AliMe Assist in Alibaba Group but with higher efficiency than SMN.
To address the second problem of insufficient labeled data, we study transfer learning (TL) (Pan and Yang, 2010) to utilize a source domain with adequate labeling to help the target domain. A typical TL approach is to use a shared NN (Mou et al., 2016;Yang et al., 2017) and domain-specific NNs to derive shared and domain-specific features respectively. Recent studies (Ganin et al., 2016;Taigman et al., 2017;Liu et al., 2017) consider adversarial networks to learn more robust shared features across domains. Inspired by these studies, we extended our method with a Transfer Learning module to leverage information from a resource-rich domain. Similarly, our TL module consists of a shared NN and two domainspecific NNs for source and target domains. The output of the shared NN is further linked to an adversarial network as used in (Liu et al., 2017) to help learn domain invariant features. Meanwhile, we also use domain discriminators on both source and target features derived by domainspecific NNs to help learn domain-specific features. Experiments show that our TL method can further improve the model performance on a target domain with limited data.
To the best of our knowledge, our work is the first to study transfer learning for context-aware question matching in conversations. Experiments on both benchmark and commercial data sets show that our proposed model outperforms several baselines including the state-of-the-art SMN model. We have also deployed our model in an industrial bot called AliMe Assist 3 and observed a significant improvement over the existing online model.

Model
Our model is designed to address the following general problem. Given an input sequence of utterances {u 1 , u 2 , . . . , u n } and a candidate question r, our task is to identify the matching degree between the utterances and the question. When the number of utterances is one, our problem is identical to paraphrase identification (PI) (Yin and Schütze, 2015) or natural language inference (NLI) (Bowman et al., 2015). Furthermore, we consider a transfer learning setting to transfer knowledge from a source domain to help a target domain.

Multi-Turn hCNN (MT-hCNN)
We present an overview of our model in Fig. 1. In a nutshell, our model first obtains a representation for each utterance and candidate question pair using hybrid CNN (hCNN), then concatenates all the representations, and feeds them into a CNN and fully-connected layer to obtain our final output. The hybrid CNN (hCNN) model (Yu et al., 2017) is based on two models: a modified SEbased BCNN model (Yin et al., 2016) and a SIbased Pyramid model . The former encode the two input sentences separately with a CNN and then combines the resulting sentence embeddings as follows: where '−' and '·' refer to element-wise subtraction and multiplication, and '⊕' refers to concatenation.
Furthermore, we add a SI-base Pyramid component to the model, we first produce an interaction matrix M ∈ R m×m , where M i,j denotes the dotproduct score between the i th word in X 1 and the j th word in X 2 . Next, we stack two 2-D convolutional layers and two 2-D max-pooling layers on it to obtain the hidden representation H p . Finally, we concatenate the hidden representations as output for each input sentence pair: We now extend hCNN to handle multi-turn conversations, resulting MT-hCNN model. Let {u 1 , u 2 , u 3 , . . . , u n } be the utterances, r is the candidate question.
Note that H is obtained by stacking all the h, CNN 3 is another CNN with a 2-D convolutional layer and a 2-D max-pooling layer, the output of CNN 3 is feed into a fully-connected layer to obtain the final representation O.

Transfer with Domain Discriminators
We further study transfer learning (TL) to learn knowledge from a source-rich domain to help our target domain, in order to reduce the dependency on a large scale labeled training data. As similar to (Liu et al., 2017), we use a shared MT-hCNN and two domain-specific MT-hCNNs to derive shared features O c and domain-specific features O s and O t . The domain specific output layers are: where W sc , W tc , W s , and W t are the weights for shared-source, shared-target, source, and target domains respectively, while b s and b t are the biases for source and target domains respectively. Following (Liu et al., 2017), we use an adversarial loss L a to encourage the shared features learned to be indiscriminate across two domains: where d i is the domain label and p(d i |·) is the domain probability from a domain discriminator. Differently, to encourage the specific feature space to be discriminable between different domains, we consider applying domain discrimination losses on the two specific feature spaces. We further add two negative cross-entropy losses: L s for source and L t for target domain: where I d i =d is an indicator function set to 1 when the statement (d i = d) holds, or 0 otherwise. Finally, we obtain a combined loss as follows: where Θ denotes model parameters.

Experiments
We evaluate the efficiency and effectiveness of our base model, the transferability of the model, and the online evaluation in an industrial chatbot. Datasets: We evaluate our methods on two multiturn conversation corpus, namely Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015) and AliMe data. Ubuntu Dialog Corpus: The Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015) contains multiturn technical support conversation data collected from the chat logs of the Freenode Internet Relay Chat (IRC) network. We used the data copy shared by Xu et al. , in which numbers, urls and paths are replaced by special placeholders. It is also used in several previous related works (Wu et al., 2017;Yang et al., 2018) 4 . It consists of 1 million context-response pairs for training, 0.5 million pairs for validation and 0.5 million pairs for testing.
AliMe Data: We collect the chat logs between customers and a chatbot called AliMe from "2017-10-01" to "2017-10-20" in Alibaba 5 . The chatbot is built based on a question-to-question matching system , where for each query, it finds the most similar candidate question in a QA database and return its answer as the reply. It indexes all the questions in our QA database using 4 The data can be downloaded from https: //www.dropbox.com/s/2fdn26rj6h9bpvl/ ubuntu%20data.zip?dl=0 5 The textual contents related to user information are filtered. Lucene 6 . For each given query, it uses TF-IDF ranking algorithm to call back candidates. To form our data set, we concatenated utterances within three turns 7 to form a query, and used the chatbot system to call back top 15 most similar candidate questions as candidate "responses". 8 We then asked a business analyst to annotate the candidate responses, where a "response" is labeled as positive if it matches the query, otherwise negative. In all, we have annotated 63,000 context-response pairs. This dataset is used as our Target data. Furthermore, we build our Source data as follows. In the AliMe chatbot, if the confidence score of answering a given user query is low, i.e. the matching score is below a given threshold 9 , we prompt top three related questions for users to choose. We collected the user click logs as our source data, where we treat the clicked question as positive and the others as negative. We collected 510,000 query-question pairs from the click logs in total as the source. For the source and target datasets, we use 80% for training, 10% for validation, and 10% for testing. Compared Methods: We compared our multiturn model (MT-hCNN) with two CNN based models ARC-I and ARC-II (Hu et al., 2014), and several advanced neural matching models: MV-LSTM , Pyramid  Duet (Mitra et al., 2017), SMN (Wu et al., 2017) 10 , and a degenerated version of our model that removes CNN 3 from our MT-hCNN model (MT-hCNN-d). All the methods in this paper are implemented with TensorFlow and are trained with NVIDIA Tesla K40M GPUs. Settings: We use the same parameter settings of hCNN in (Yu et al., 2017). For the CNN 3 in our model, we set window size of convolution layer as 2, ReLU as the activation function, and the stride of max-pooling layer as 2. The hidden node size of the Fully-Connected layer is set as 128. AdaDelta is used to train our model with an initial learning rate of 0.08. We use MAP, Recall@5, Recall@2, and Recall@1 as evaluation metrics. We set λ 1 = λ 2 = λ 3 = 0.05, and λ 4 = 0.005.

Comparison on Base Models
The comparisons on base models are shown in Table 1. First, the RNN based methods like MV-LSTM and SMN have clear advantages over the two CNN-based approaches like ARC-I and ARC-II, and are better or comparable with the state-ofthe-art CNN-based models like Pyramid and Duet; Second, our MT-hCNN outperforms MT-hCNNd, which shows the benefits of adding a convolutional layer to the output representations of all the utterances; Third, we find SMN does not perform well in AliMeData compared to UDC. One potential reason is that UDC has significantly larger data size than AliMeData (1000k vs. 51k), which can help to train a complex model like SMN; Last but not least, our proposed MT-hCNN shows the best results in terms of all the metrics in AliMe-Data, and the best results in terms of R@2 and R@1 in UDC, which shows the effectiveness of MT-hCNN.
We further evaluate the inference time 11 of these models. As shown in Table 1, MT-hCNN has comparable or better results when compared with SMN (the state-of-the-art multi-turn conversation model), but is much more efficient than SMN (∼60% time reduction). MT-hCNN also has similar efficiency with CNN-based methods but with better performance. As a result, our MT-hCNN module is able to support a peak QPS 12 of 40 on a cluster of 2 service instances, where each instance reserves 2 cores and 4G memory on an Intel Xeon E5-2430 machine. This shows the model is applicable to industrial bots. In all, our proposed MT-hCNN is shown to be both efficient and effective for question matching in multi-turn conversations.

Transferablity of our model
To evaluate the effectiveness of our transfer learning setting, we compare our full model with three baselines: Src-only that uses only source data, Tgt-only that uses only target data, and TL-S that uses both source and target data with the adversarial training as in (Liu et al., 2017).
As in Table 2, Src-only performs worse than Tgt-only. This shows the source and target domains are related but different. Despite the domain shift, TL-S is able to leverage knowledge from the source domain and boost performance; Last, our model shows better performance than TL-S, this shows the helpfulness of adding domain discriminators on both source and target domains.

Related Work
Recent research in multi-turn conversations has focused on deep learning and reinforcement learning (Shang et al., 2015;Li et al., 2016a,b;Sordoni et al., 2015;Wu et al., 2017;Yang et al., 2018). The recent proposed Sequential Matching Network (SMN) (Wu et al., 2017) matches a response with each utterance in the context at multiple levels of granularity, leading to state-of-the-art performance on two multi-turn conversation corpora. Different from SMN, our model is built on CNN based modules, which has comparable results but with better efficiency. We study transfer learning (TL) (Pan and Yang, 2010) to help domains with limited data. TL has been extensively studied in the last decade. With the popularity of deep learning, many Neural Network (NN) based methods are proposed (Yosinski et al., 2014). A typical framework uses a shared NN to learn shared features for both source and target domains (Mou et al., 2016;Yang et al., 2017). Another approach is to use both a shared NN and domain-specific NNs to derive shared and domain-specific features (Liu et al., 2017). This is improved by some studies (Ganin et al., 2016;Taigman et al., 2017;Liu et al., 2017) that consider adversarial networks to learn more robust shared features across domains. Our TL model is based on (Liu et al., 2017), with enhanced source and target specific domain discrimination losses.

Conclusion
In this paper, we proposed a conversation model based on Multi-Turn hybrid CNN (MT-hCNN). We extended our model to adapt knowledge learned from a resource-rich domain. Extensive experiments and an online deployment in AliMe E-commerce chatbot showed the efficiency, effectiveness, and transferablity of our proposed model.