Multi-Turn Dialogue Generation in E-Commerce Platform with the Context of Historical Dialogue

As an important research topic, customer service dialogue generation tends to generate generic seller responses by leveraging current dialogue information. In this study, we propose a novel and extensible dialogue generation method by leveraging sellers’ historical dialogue information, which can be both accessible and informative. By utilizing innovative historical dialogue representation learning and historical dialogue selection mechanism, the proposed model is capable of detecting most related responses from sellers’ historical dialogues, which can further enhance the current dialogue generation quality. Unlike prior dialogue generation efforts, we treat each seller’s historical dialogues as a list of Customer-Seller utterance pairs and allow the model to measure their different importance, and copy words directly from most relevant pairs. Extensive experimental results show that the proposed approach can generate high-quality responses that cater to specific sellers’ characteristics and exhibit consistent superiority over baselines on a real-world multi-turn customer service dialogue dataset.


Introduction
Over the past years, online shopping has experienced incredible growth. In e-commerce platforms, e.g., Amazon and Taobao, brilliant customer service is becoming increasingly important because of significantly reducing the workload of shop sellers. Ideally, sellers should provide highquality responses to address the personal needs of the customers. However, such cost can be prohibitive for most small businesses, which inspires us to be concerned with the multi-turn dialogue * Both authors contributed equally to this research. † Corresponding Author: Zhongqing Wang.

Current Dialogue Customer
Hello. Seller I'm grad to service you, dear.

Seller
What can I do for you?
Customer 1.65 meters tall and weigh 48 kg, which size should I buy?

Seller
In my experience, you may fit the M size.

Customer
Aright, when could you send it off? Seller As soon as we can.

Customer
Will you give me some discount? Seller's Historical Dialogues C1 Hello. C2 I'm looking for some help. S1 Welcome to our store.

S2
What can I do for you? C3 I see and is there any coupons? S3 You can find it on our main page.

S4
Click the link to get it. C4 OK, I find it, thank you. S5 I'm grad that I can help you. Seller's Response HRED I'm sorry not.

Ground Truth
You can open the main page of our store and draw a coupon. Table 1: The example of customer server dialogue between the Seller (S) and the customer (C) plus the generative results. The above block is the current dialogue context, the middle one is the historical dialogue of the server, and the below one is the generated response.
generation task, which is critical in many natural language processing applications, such as customer services, intelligent assistants, and chatbot.
Despite most existing research works on singleturn dialogue generation , multiturn dialogue generation has gained increasing attention from both academia and industry. One reason is that it is more accordant with the real application scenario, such as chatbot and customer services. More importantly, the generation process is more difficult since there are more context information and constraints to consider. Serban et al. (2016) proposed HRED, which uses the hierarchical encoder-decoder framework to model all the context sentences. Since then, the HRED based models have been widely used in different multi-turn dialogue generation tasks, and many variants have been proposed. However, the standard HRED can not adapt easily to our customer service scenario well because of two reasons: simply treating all contexts indiscriminately is not proper since the response is only usually related to a few previous contexts; deliberately ignoring dialogue background knowledge is problematic since the response also has a close relationship with specific products, service mode and even seller characteristics. Table 1 illustrates an example in which standard HRED trained on massive data tends to generate generic responses and cannot simulate such unique seller specific responses without using any external knowledge (e.g., S 3 ).
Recent studies have noticed the problem and focused on generating appropriate seller responses by integrating external information, e.g., product attributes and titles, into single-turn dialogue generation Gao et al., 2019). However, they are difficult to generalize in reality because of limited materials on hand and different scenarios. Intuitively, sellers' historical dialogues contain richer reply clues, e.g., similar topics or even the same responses happened previously. Ideally, incorporating historical dialogues into our task should further improve response quality. However, such dialogues may be filled with noises or relevant content, which poses a huge challenge to the automatic selection of helpful context. The sellers' historical dialogues mentioned above are multi-turn dialogues pre-selected from the same sellers in our study. In this paper, we propose a novel and extensible Conditional Historical Generation model to generate high-quality seller responses. The main contributions are summarized as below: • We propose an extensible model which first studies the effectiveness of incorporating historical dialogue contexts into generation.
• We propose a novel dialogue selection mechanism to locate the most relevant historical customer utterances and seller utterances, and then produce their context representations.
• We use a gated strategy to generate the final response by comprehensively considering the different importance of current dialogue and historical dialogues under a hybrid network.
• Empirical results show that our proposed approach outperforms state-of-the-art competitors significantly on a real-world multi-turn customer service dialogue dataset with both automatic and manual evaluation.

Related Work
Previous research on multi-turn dialogue generation (Chaudhuri et al., 2018;Zhou et al., 2018;Olabiyi et al., 2018) has drawn a huge amount of attention from academia and industry, which has broader usage scenario than single-turn dialogue generation (Zhang et al., 2018;Li et al., 2017). Recent studies have noticed the problem and try to alleviate it by incorporating helpful external information into response generation, e.g., speakers' emotional information. (Zhang et al., 2019a,b;Wang et al., 2020).  proposed a review response generation model in the E-commerce platform, which used the reinforcement learning and copy mechanism to fuse external product information, thereby generating informative and diverse responses. Zheng et al. (2019) proposed a dialogue generation model considering personality traits such as age, name, and gender. Meng et al. (2019) proposed RefNet, which used background descriptions about the target dialogue and used a copy mechanism to copy tokens or semantic units. However, all these models are difficult to generalize in reality because of using different materials, which are not always accessible.
Different from previous studies, which either simply ignore or selectively consider limited external information, we propose a novel and extensible model which integrates sellers' historical dialogues into a multi-turn dialogue generating process and avoids interference from background noise. To our best knowledge, this is the first attempt to incorporate helpful historical dialogues into multi-turn customer service dialogue generation.

Conditional Historical Generation
Given current dialogue D and its R relevant historical dialogues participated by the same seller, i.e., H = {D 1 , D 2 , ..., D R }, our task aims to generate a high quality response Y based on the current dialogue D and its historical dialogues H. In this section, we propose a novel Conditional Historical Generation (CHG) model and display its architecture in Figure 1, which consists of four main modules: Current Dialogue Encoder, Historical Dialogue Encoder, Response Representation Encoder and Context-Response Attention Decoder.

Current Dialogue Encoder
Let a dialogue D containing L utterances as D = [u 1 , u 2 , ..., u L ], where u i = [w i1 , w i2 , ..., w il ] is the i-th utterance posted by a customer or a seller. The encoder represents the hierarchical information in the dialogue D, which consists of two layers: Utterance Layer and Dialogue Layer.
Utterance Layer transforms an utterance u i into a sequence of low-dimensional dense vec- where V is the vocabulary size and K is the dimension of word embeddings. Each word embedding e i is then fed into a bidirectional-GRU, and produces hidden state h ij ∈ R Z according to the formula as below: Actually, there are various ways to produce utterance representation, and the simplest one is to use the last h il as the final utterance representation u i . Dialogue Layer can represent the global context in the dialogue via a N -layer Transformer-Block. One critical advantage of the block is that it has the ability to capture long distant dependencies among utterances. Specifically, we first parameterize position embeddings {c i |i ∈ [1, L]} for all the consisted utterances. The position embeddings are then simply concatenated to the utterance representations {u i |i ∈ [1, L]}. Finally, we obtain a sequence of utterance representations: U = u 1 , u 2 , ..., u L and u i = u i ⊕ c i , and "⊕" denotes the element-wise summation operation.
After that, we feed a matrix of n queries Q ∈ R n×d , keys K ∈ R n×d and values V ∈ R n×d into the Transformer-Block, the output representation O ∈ R n×d can be represented by the formula: To obtain the context representation of dialogue D, the Transformer-Block feed the U as queries, keys, and values in equation 2, and finally output the dialogue context representation O D .

Historical Dialogue Encoder
For the same question initiated by a customer, different sellers may respond differently, depending on various scenarios. It is observed that historical dialogues contain lots of unique seller-specific words which can not be generated easily. This encoder can represent relevant customer questions and seller responses, respectively. It includes two layers: Utterance Layer and Dialogue Selection.
Utterance Layer: In a historical dialogue, each customer utterance (i.e., question) usually matches one or more seller utterances (i.e., responses). For example, in Figure 2, u C 1 is responded by closely followed u S 1 and u S 2 , u C 2 is responded by closely followed u S 3 and u S 4 . With the same utterance encoder, each customer/seller utterance is represented as Note that the processing method is similar for multiple historical dialogues via simple concatenation.
Dialogue Selection: Different historical utterances contribute differently to the target response generation. On the one hand, only a few historical customer utterances are semantically similar to the latest customer question. On the other hand, not all the historical seller utterances respond to the historical customer utterances nearby. In Figure 2, we employ a dialogue selection strategy which contains two layers: customer attention layer selects relevant customer utterances {u C i } for the customer question u i ; seller attention layer finds relevant utterances from {u S j } N C i for each u C i .
(1) Customer Attention Layer: Given the latest customer question u L in the current dialogue, we use it to find similar customer utterances from historical dialogues. Specifically, we opt for an attention mechanism which is formulated by: where W C , W C , v and b C are trainable model parameters, α C i is the attention weight, N C is the number of historical customer questions and o C is the representation of all the related questions.

Context-Response
Attention Decoder Figure 1: The architecture of our proposed model. (2) Answer Attention Layer: Given the representation of any historical customer question u C i , we use it to match most relevant answers from the historical dialogue {u S j } N C i . Specifically, we use another attention mechanism to calculate the different importance of seller utterances as below:

Seller Attention Layer
where W S , W S , v and b S are learnable model parameters, α S ij is the attention weight of u S j read by u C i . In order to obtain the final attention weight for each u S i , we use a cascading attention multiplication operation, which is formulated by: where α j is the compound attention weight, N S is the number of seller utterances and o S is the representation of all the historical seller's utterances.

Response Representation Encoder
Given the response Y = {y 1 , ..., y M } as the input, the same utterance encoder is used to transform Y into a sequence of low-dimensional dense vectors Y = y 1 , y 2 , ..., y M . Then, We can also parameterize position embeddings c Y t |t ∈ [1, M ] . Another Transformer-Block feed the input U = y 1 , y 2 , ..., y M and output the response repre- Note that we also use the mask operator on the response for the training, i.e., we mask {y t+1 , ..., y M } and only see {y 1 , ..., y t−1 } if y t is expected to be generated.

Context-Response Attention Decoder
The Decoder is a hybrid between a dialogue generation network and a dialogue copy network, as it allows both directly copying words from historical dialogues through copy mechanism and generating words from a fixed vocabulary.
Dialogue Generation: The third Transformer-Block component feeds the output of the Current Dialogue Encoder O D as keys and values, and the output of the Response Representation Encoder O R as queries, and finally outputs O G . Then, we utilize a softmax layer to obtain the word probability for the generation process as below: where W G and b G are trainable parameters, p G is the probabilities of all the words in the vocabulary. Dialogue Copy: Inspired by the copy mechanism used in (Vinyals et al., 2015), we allow the decoder to copy words from historical dialogues directly. For each seller utterance u S j , we use the word vector O R t−1 to find the most important words by attention mechanism. For any word w i in u S j , we obtain its attention weight α w ij . Finally, we sum all the attention weights {α w ij } after multiplying the answer weights {α j } (calculated in Equation 5), and we can obtain the probability of copying any word y t . The calculation process can be formulated as below: where W R , W R ,ṽ and b R are trainable model parameters, h S j i denotes the representation of w i in u S j , indicator function I w i =yt equals one only when w i = y t , otherwise zero. Hybrid Network uses a flexible gated mechanism to decide the degree of copying historical information automatically. Given any word y t , we combine both p G t and p C t together into the final probability p t . Note that if the word never appears in any seller utterance, p C t should be zero.
where g ∈ (0, 1) is calculated by the gated mechanism as below: where W G and b G are learnable model parameters, [; ] denotes the vector concatenation operation, and σ(·) = 1 1+e −x is the sigmoid function.

Training
Our model is optimized in an end-to-end manner. Let θ denote all the model parameters. Given any

Statistical Results
Num total number of dialogues 60,000 average length of utterances 27 average length of dialogues 9 average number of historical dialogues 3

Experiments
In this section, we conduct extensive experiments to study the effectiveness of our approach with both automatic and human evaluation metrics.

Dataset Construction
As far as we know, existing public dialogue datasets do not contain enough sellers' historical dialogues, so we construct a real-world dataset from a top online shopping website in China.
Though our experiments are based on a Chinese dataset, our approach can be easily adapted to other languages, such as English and Japanese. Specifically, we collect 60K multi-turn service dialogues in the clothing domain. For each dialogue, we randomly sample 1-5 latest historical dialogues with the same seller, product, and service topic. According to the statistics, the average utterance number for each dialogue is 9, and each utterance contains 27 Chinese characters on average. We partition the dataset into train/validation/testing set by an 80/10/10 split. The statistical results of our dataset are displayed in Table 2. All the related resources will be publicly available 1 .

Experimental Settings
All the learnable model parameters are initialized by sampling values from a uniform distribution U(−0.01, 0.01). The hyper-parameters are tuned on the validation set. The best settings of all the hyper-parameters are summarized in Table 3.
To evaluate our approach, we adopt widely used BLEU, ROUGE, and Distinct as automatic evaluation metrics. BLEU (Papineni et al., 2002) is widely used in neural machine translation, which measures word overlap between the generated text and the ground-truth. BLEU score is calculated using the NLTK 2 package, in which the score is an average of BLEU-1˜4. ROUGE 3 (Lin, 2005) is another popular automatic evaluation metric in text summarization. The ROUGE score is obtained through the Rouge package. We report ROUGE-1, ROUGE-2, and ROUGE-L in this work. Distinct is recently proposed by Li et al. (2015), which evaluates the diversity degree of the generated responses by calculating the number of distinct unigrams and bigrams in the generated responses.
All the methods are implemented by ourselves with PyTorch and run on a server configured with a Tesla V100GPU, 2 CPU, and 32G memory.

Comparison with Baselines
We compare the proposed approach with the following advanced baseline methods, including: 1) Seq2Seq+Att is the standard Seq2Seq model with attention mechanism (Sutskever et al., 2014).
2) HRED uses a hierarchical encoder-decoder framework to model all the context utterances, which has been widely used in different multi-turn dialogue generation tasks (Serban et al., 2016).
3) HRED+HD augments HRED with the historical dialogues. We simply treat the historical dialogues as the context of the current dialogue. 4) ReCoSa uses the self-attention mechanism to measure the relevance between the response and each context, which is the "state-of-the-art" multiturn dialogue generation model and closely related to our work. (Zhang et al., 2019c) 5) ReCoSa+HD uses the same merge method as that used in "HRED + HD".
Results and Analysis: The results of comparison are reported in Table 4. All the experiments are repeated 10 times, and a t-test proves the improvement of our model is significant (i.e., t <0.005). ReCoSa is the "state-ofthe-art" method, which performs better than traditional "Seq2Seq+Att" and "HRED" because of using a self-attention mechanism. However, all these methods can not compete with the methods considering historical information. It is observed that "ReCoSa+HD" and "HRED+HD" achieve further improvements on all the metrics, which proves that their generated responses can be borrowed from sellers' historical dialogue information, which contains product attributes, seller characteristics, and even similar responses. The results illustrate the effectiveness of using historical information.
Our model performs better than "ReCoSa+HD" and "HRED+HD" consistently on all the metrics. This is because the competitors do not especially  Table 6: Comparison among different configured dialogue generation models using automatic evaluation metrics. Notation "-" denotes removing of a specific component used in our model. The best results are highlighted. ( We don't provide freight insurance now, but if you want to return it, we will pay for 6 CNY freight. ) Ground Truth: 现在不提供运费险，如果你不喜欢，我们会承担6元运费退货或者换货。 (Freight insurance is not provided now. If you don't like it, we will pay 6 CNY freight return or exchange.) model the historical context information, and they are sensitive to irrelevant dialogue noises. Different from theirs, our model uses a dialogue selection module to pinpoint the most relevant responses in historical responses. Meanwhile, our model uses a gated mechanism to balance historical information copying and dialogue generation.
The amount of historical dialogues may influence model performance greatly. Therefore, we build a smaller historical dialogue dataset by halving each seller's historical dialogues. The results show that our model with 50% historical dialogues still performs better than ReCoSa on nearly all the metrics, but slightly worse than our model trained on full historical dialogues. This is reasonable because more historical dialogues will contain more similar responses, and our model is insensitive to the dialogue noises.

Human Evaluation
We randomly sampled 2,000 dialogues to conduct a manual evaluation and employ three annotators with professional background knowledge to rate the generated responses with 0-3 scores and label each response with the majority score . The annotators cannot see the historical dialogues, and only the current dialogue, the model-generated responses, and the ground truth are available for them to make the quality judgments. Score 0: unreadable responses. Score 1: incorrect or irrelevant responses. Score 2: partially relevant and correct responses. Score Is there any color difference?
Welcome to our store.
Please provide your height and weight, and I can recommend you the size.
We don't provide freight insurance now, but if you don't like it, we will refund you 6 CNY freight.
If the color is wrong, you can return it at any time.
Because different monitors are used, the picture and the actual product may look slightly different.
I want to know more about this skirt.
Is there a freight insurance?
OK, I'll take the order. Thank you for your purchase.

0.000
We will deliver the goods as soon as possible. 0.998

Current Customer Question
Do you provide freight insurance?
Freight insurance is not provided now. If you don't like it, we will pay 6 CNY freight return or exchange.
Ground Truth Figure 3: The pairwise interactive representation of the example in case study.
3: correct and relevant responses. Score: the weighted sum of all the scores. The distributions over scores for each model are displayed in Table  5. From the results in Table 5, we can observe that the models using historical dialogues usually generate more high-quality responses than other competitors ignoring them. Our model obtains the highest weighted score among all the methods. This again proves that using historical dialogues indeed helps to generate high-quality responses, which are more consistent with the sellers' real responses in customer service scenario.

Ablation Study
Different model configurations may influence model performance greatly. Thus, we conduct an ablation study to validate the effectiveness of each model component used in this work. Table 6 shows the results of the ablation test based on various automatic evaluation metrics. We design several partially configured model variants, including: "-(C-S)" means the model doesn't distinguish between speakers and copies from all the historical utterances; "-gate" removes the gated mechanism; "-copy" removes the copy mechanism.
From Table 6, we can find all the partially configured models can not compete with our fullyconfigured model, and give in-depth analysis: -(C-S): Customer and seller usually play different roles in historical dialogues, and seller utterances can provide more response clues compared with customer utterances. Without differentiating, speakers may cause the model to repeat customer questions rather than generate responses.
-copy: We find that the copy mechanism helps a lot in improving the Distinct metrics because it can directly copy some out-of-vocabulary words from the relevant historical dialogues, which tends to produce seller-specific responses rather than generic ones. This naturally achieves better performance on BLEU and ROUGE metrics.
-gate: The generation module and the copy module usually contribute differently to the generation at each time step. This is because the model prefers the generation module than the copy module, which leads to the generation of generic responses rather than a seller-specific response. Without the gating mechanism, P G t and P C t play equal importance, thus P t = 1 2 P G t + 1 2 P C t .

Case Study
To compare different models intuitively, we give a multi-turn dialogue example in Table 7, and the original Chinese text has been translated into English text. We compare our approach with ReCoSa ignoring/using historical information and display their generated results. From Table 7, we can find that when asking whether there is freight insurance, ReCoSa generates an inappropriate response (I'm sorry not, but you can buy it by yourself.). This is because ReCoSa can not learn sellerspecific responses from massive data without considering any external information. Instead, "Re-CoSa+HD" and our approach generate much better responses by using external information from the historical dialogue, which contains similar responses to the ground truth. Our approach performs the best because of allowing to copy more response details (e.g., "6 CNY") through our historical dialogue selection strategy. We also give an example of calculating atten-tion weights of historical seller utterances in Figure 3, where customer utterances are on the left and seller utterances are on the right, the edges denote Customer-Seller interactions, and the attention weights are listed aside. It is observed that S 5 has the largest attention weight through the formula of 0.756 * 1.000 = 0.756, which again proves the effectiveness of our historical dialogue selection strategy on finding relevant seller responses.

Conclusion
In this paper, we propose a novel Conditional Historical Generation model for generating highquality multi-turn dialogues in E-commerce scenario. Different from previous studies which utilize various external information limited to a specific scenario, our model incorporating historical dialogue information into generation is easy to generalize and applied to practical applications. Specifically, we introduce a novel historical dialogue selection strategy to find appropriate historical seller responses for the latest customer question. Finally, a gated mechanism is used to fuse the results from both the generation module and copy module. The experimental results on a real-world multi-turn dialogue dataset show the effectiveness of our approach.
In the future, we will consider using customer characteristics for generating personalized responses for different customers.