Cross Copy Network for Dialogue Generation

In the past few years, audiences from different fields witness the achievements of sequence-to-sequence models (e.g., LSTM+attention, Pointer Generator Networks, and Transformer) to enhance dialogue content generation. While content fluency and accuracy often serve as the major indicators for model training, dialogue logics, carrying critical information for some particular domains, are often ignored. Take customer service and court debate dialogue as examples, compatible logics can be observed across different dialogue instances, and this information can provide vital evidence for utterance generation. In this paper, we propose a novel network architecture - Cross Copy Networks(CCN) to explore the current dialog context and similar dialogue instances' logical structure simultaneously. Experiments with two tasks, court debate and customer service content generation, proved that the proposed algorithm is superior to existing state-of-art content generation models.


Introduction
As an important task in Natural Language Generation (NLG), dialogue generation empowers a wide spectrum of applications, such as chatbot and customer service automation. In the past few years, breakthroughs in dialogue generation technology focused on a series of sequence-to-sequence models (Sutskever et al., 2014). More recently, external knowledge is employed to enhance model performance. ; Liu et al. (2018) can assist dialogue generation by using knowledge triples. Similarly, Li et al. (2019); Rajpurkar et al. (2018); ; Reddy et al. (2019) explore document as knowledge discovery for dialogue generation, and Xia et al. (2017); Ye et al. (2020); Ghazvininejad et al. (2018); Parthasarathi and Pineau (2018) utilize unstructured knowledge to explore in the open-domain dialogue generation. However, unaffordable knowledge construction and defective domain adaptation restrict their utilization.
Copy-based generation models (Vinyals et al., 2015;Gu et al., 2016) have been widely adopted in content generation tasks and show better results compared to sequence-to-sequence models when faced with out-of-vocabulary problem. Thanks to their nature of leveraging vocabulary and context distributions for content copy, it enables to copy the aforementioned named entities (e.g., person names, locations, company names) appeared in the above context) from the upper context to improve the specificity of the generated text.
In the task of dialogue generation, we can often observe the phrases/utterance patterns across different "similar dialogue" instances. For example, in customer service, the similar inquiries from the customers will get similar responses from the staff. It motivates us to build a model that can not only copy the content within the upper context of the target dialogue instance, but also learn the similar patterns across different similar cases of the target instance. Such external copy can be critical in some scenarios.
As shown in Figure.1, we propose two different kinds of copy mechanisms in this study: vertical copy context-dependent information within the target dialogue instance, and horizontal copy logic-dependent content across different 'Similar Cases' (SC). This framework is labeled as Cross-Copy Networks (CCN). As exemplar dialogue depicted, judges may repeat (horizontal copied) words, phrases or utterances from historical dialogues when those SCs sharing similar content, e.g., 'A sue B because of X and Y'.
In order to validate the proposed model, we employ two different dialogue datasets from two orthogonal domains -court debate and customer ser-  Figure 1: An example from the court debate showing the intuition of utterance generation by leveraging its context and the information from its similar cases. We name the copy process from the context as vertical copy and the one copied from its neighbor cases as horizontal copy.
vice. We apply proposed CCN to both datasets for dialogue generation. Experiments show that our model achieves the best results. To sum up, our contributions are as follows: • We propose a new end-to-end model, the Cross Copy Networks (CCN), which enables internal (vertical) copy from the target dialogue and external (horizontal) copy from similar cases in the dataset without employing any external resources.
• We validate the proposed model by leveraging two different datasets -court debate and customer service datasets. Experiments show that our model has achieved State-of-the-art results in both domain datasets.
• To motivate other scholars to investigate this novel but an important problem, we make the experimental datasets publicly available 1 .

Model
In this section, we introduce the proposed model, the Cross Copy Network, which has three major components: 1. Target Case Representation: we obtain the target case representation with two attention distributions at the utterance layer and the dialogue layer, which contribute to the final attention distribution (Section 2.1); 1 https://github.com/jichangzhen/CCN 2. Similar Case Representation: we fine-tune the pre-trained language model to obtain similar cases, and adopt the same method as the target case for encoding SC (Section 2.2); 3. Cross Copy: we learn two pointer distributions which are used to achieve internal (vertical) copy and external (horizontal) copy respectively (Section 2.3).

Target Case Representation
Given a dialogue D = {(U, R) L } containing L utterances, the U and R stand for utterance and role of speaker, respectively, where each utterances in the dialogue is expressed as U i = {w i1 , w i2 , ..., w il }, and the l represents the length of the utterances. To distinguish SC and original context, we define the original context (historical dialogue) as Target Case.
Our encoder is shown in Figure 2. It is designed with hierarchical infrastructure consisting of three levels of components: utterance layer, dialogue layer and transformer layer.

Utterance Layer
In the dialogue, role information can make critical contribution to the task of dialogue generation, and different roles may not share consistent lexical spaces. For role information R i , we utilize a 100-dimensional vector to represent different roles which is randomly initialized, and updated via back propagation.  Figure 2: The encoder of CCN is divided into three levels: (1) Utterance layer: is used to encode role information and word level information; (2) Dialogue layer: is used to encode sentence level information; (3) Transformer layer: is used to capture long distance dependence for dialogue.
To take the role information into consideration for utterance representation learning, we concatenate the role information with each word of utterance expressed as S ij , and we use Bidirectional Long-Short Term Memory networks (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) to encode the semantics of the utterance while maintaining its syntactic be expressed as h d .
In order to obtain the different importance of different historical dialogue information, we adopt the attention mechanism (Bahdanau et al., 2014) to obtain the utterance level's attention distribution in historical dialogue a u j and utterance context h U i : The a u j represents the word probability distribution for the target utterance 2 .

Dialogue Layer
In order to represent the context information of the dialogue, we also use Bi-LSTM to encode the utterance dependency to obtain a global representation of an utterance as dialogue, denoted as h D .
We obtain the dialogue layer attention distribution a d i , which is a probability distribution over the prior utterances in the target dialogue. The a d i can be expressed as: 2 The W u and b u are learnable parameters, the tanh is hyperbolic tangent function.
The final context attention distribution A d of target case can be expressed as the product of a u j and a d i :

Transformer Layer
To expand the model's ability to focus on different locations of long context, we adopt the selfattention with multi-heads (Vaswani et al., 2017) to explore an enhance representation, denoted as Transformer-Block. We feed h D to a N-layer Transformer-Block to suppress the long distance dependency for dialogue. Following this strategy, the final target case representation is:

Similar Case Representation
In this section, we introduce the approach of obtaining and representing similar cases.

Similar Case Finding
The similar cases (SCs) of the target case is discovered from the same dataset where the target case stays. To make it more efficient, we use Elas-ticSearch 3 to retrieve top 50 similar cases as candidates by leveraging the target case as a query and the all the other cases as documents. To make it more effective, we fine-tune the pre-trained It learns the pointer distribution α to obtain the content to be copied from the context (vertical copy) as well as the pointer distribution β to determine the content to be copied from its similar cases (horizontal copy).
concatenation of the target case and each candidate retrieved above as a binary classifier, to obtain similarity score.

Similar Case Encoding
For SC encoding, we adopt the same method as the target case 5 . We fuse role information with each word of utterance in the SC and then use Bi LSTM to obtain hidden state h s . Next, we adopt the attention mechanism to obtain the utterance layer distribution and dialogue layer distribution. Therefore, we obtain attention distribution for different words a u * j of each utterance and different utterances a d * i of each SC. Then, we get final attention distribution A s which can be expressed as the product of a u * j and a d * i . Finally, we use the N-layer transformer-block to obtain the final SC representation C s .

Cross Copy
In this section, we learn two pointers distribution and to achieve internal (vertical) copy and external (horizontal) copy. The decoder's structure shown as Figure 3.
On the time step t, we concatenate target case context vector C d t with decoder states s t to get the 5 We use two identical encoders to encode target case and similar case, while two encoders' parameters are not shared distribution of the current vocabulary: For the cross copy, the algorithm execution process is divided into two stages.
At the first stage, we perform vertical copy. With the target case encoder hidden state h d , context vector C d , and decoder hidden states s as mentioned above, on the time step t, we can learn vertical copy probability distribution α, it determines whether to copy the words from the historical dialogue. It can be expressed as Eq.7: Combined with the attention distribution A d , we can get the dynamic extended vocabulary v d by pointer distribution α: In the second stage, we learn the horizontal copy probability distribution β of SC. For SC context vector C s and hidden state h s , on the time step t, we combine decoder hidden state s, to get the horizontal copy pointer distribution β: From the encoder, we get the attention distribution A s of SC, we then perform a second expansion of the dynamic vocabulary to obtain the final vo- It should be noted, if w is not in original vocabulary but in SC, the β is 1. Model will copy the w from SC. One of the advantages of our model is that it can produce out-of-vocabulary word.
In the formula above, the σ is sigmoid function. The W h , W c , W s , b d ,W h * , W c * , W s * and b s are learnable parameters.

Loss function
In this dialogue generation task, for each dialogue D, the loss function is defined as: Denoting all the parameters in our model as δ, then we obtain the following optimized objective function: To minimize the objective function, we use the diagonal variant of Adam (Zeiler, 2012).

Dataset
We employed two datasets for the experiment, Court Debate Dataset (CDD) from judicial field and Jing Dong Dialogue Corpus (JDDC) ) from e-commerce field. The details of the dataset are illustrated in Table 1. During training, the data is divided into the training set, development set and test set 6 .

Court Debate Dataset
For CDD, we collected 121, 016 court debate records of private lending dispute civil cases 7 . We take the judge and the historical conversation with the plaintiff and the defendant as the model input, and the judge's utterance as the model output. These records are divided into 260, 190 pairs of samples by experts with legal knowledge.
In the experiments, we adopted the top 326, 603 cases. The proposed algorithm and baselines are set to generate the utterances of the customer service, and the historical context between the customer service and the customer is set as input.

Evaluation Metrics
We adopt two evaluation methods to validate the proposed model: Automatic Evaluation and Human Evaluation.

Automatic Evaluation
To evaluate the effectiveness of the dialogue generated by CCN, we used ROUGE (Lin and Hovy, 2003) and BLEU (Papineni et al., 2002) scores to compare different models. We report ROUGE-1, ROUGE-L and BLEU to compare the advantages and disadvantages of each model.

Human Evaluation
In order to ensure the rationality/correctness of the generated utterance, we also conducted human evaluation. We randomly selected 300 samples from the test set. Then, we recruited five annotators 9 to judge the quality of generated utterance from two perspectives: (Ke et al., 2018;Zhu et al., 2019): 6 The entire dataset is divided by a ratio of 8:1:1 for training, developing and testing, respectively. 7 Private lending dispute cases are the most frequent cause of civil cases in China. This data set is provided by the High People's Court of a province in China. All the court transcripts are manually recorded by the court clerk. 8 http://jddc.jd.com/auth_environment 9 All annotators took basic annotation training before the experiment.  Table 3: Qualitative Evaluation. We report average score (Avg) and calculate the κ value in relevance and fluency. We recruited five annotators to evaluate the sentences generated by all the models. To be fair, for each input, we shuffled the output generated by all the models and then let the annotator to evaluate. The κ represents the consistency of evaluation by different annotators. And the κ coefficient between 0.48 and 0.82 means middle and upper agreement. • Relevance: Generated utterance is logically relevant to the dialogue context and can provide meaningful information.
• Fluency: Generated utterance is fluent and grammatical.
The information on these two aspects are independently evaluated. For each aspect, we set three levels with scores: +2, +1, 0, in which 2 stands for excellent: for relevance, closely related to historical dialogue and be meaningful; for fluency, it has strong readability without grammatical error. 1 stands for good: for relevance, with some offtopic information; for fluency, sentence is readable, but with slight grammatical error. 0 means poor: for relevance, the sentence is off-topic or meaningless; for fluency, sentence has poor readability or serious grammatical errors. Finally, we obtain the weighted average score and kappa (κ) of each model to compare the effect of the model.

Training Details
During the training process, we set the dimension of word embedding as 300 and use word2vec to build the initialization word vector. The dimention of role embedding is set to 100 with random initialization. The hidden size is set to 300, we use 4 layer Transformer, where the number of heads equals to 8. The dropout probability is set to 0.8. Based on these settings, we optimize the objective function with a learning rate of 5e − 4. We perform the mini-batch gradient descent with a batch size of 64. We set maximum utterance length as 40 in decoder during generation (the generated sentence might contain sub-utterances).

Overall Performance
In the experiments, we select up to three similar cases to validate the effectiveness of the CCN, i.e., leveraging the most similar case (top-1), top two similar ones (top-2), and top three similar ones (top-3). In addition, we also test the variant of CCN(vertical-only) by only adopting vertical copy from the context, which is similar to the setting of the baseline PGN but with the proposed hierarchical dialogue encoders.
The performance of all the tested methods are reported in Table 2 and Table 3  tative and qualitative evaluation, respectively. As Table 2 shows, the proposed approach CCN with its variants outperform all the baselines in Rouge and Bleu metrics over the two datasets. We can also observe the increasing performance as the number of referred similar cases increases. As for the two qualitative criteria, CCN also shows better performance by a big margin compared to the baselines. Note that the Kappa value (κ) indicates the agreement among the annotators. As mentioned above, the increasing number of referred similar cases enables to bring about the improvement of performance, which demonstrates that the horizontal copy plays a critical role in dialogue generation without employing any external resources. However, in the training process, as the number of similar cases increases, the training speed is getting slower. Considering the time cost and memory limitation, only up to top three similar cases are utilized in this experiment to verify the proposed approach. Figure 4 shows two examples to illustrate the performance of different tested methods. As depicted in case 1, comparing with the baselines, the CCN can learn dialogue logic from SCs and accurately locate the sentence to complete the horizontal copy. Another important finding is that we can use SCs to obtain more accurate representation information. It can identify specific entities from the context for vertical copy while capturing the discourse patterns from the similar cases for horizontal copy to finally synthesize the sentence to be generated.

Case study
On the other hand, the baseline models are more inclined to generate general expressions which appear more frequent in the training data without much attention to the specific information in the context and the logical discourse patterns appearing in certain circumstances.

Error analysis
In order to explore the algorithm limitation and model capability boundary, we summarize the samples with high error rate. The following observations should be highlighted to scope the limitation of current model and enlighten future investigation for this track of research.
In the CDD, 53% of errors 10 occur when generated sentence contains the information that does not appear in context or in similar cases (e.g., "According to the provisions of Articles 44 and 45 of the Civil Procedure Law of the People's Republic of China, if the parties find that the members of the collegiate bench and the clerk experienced any of the following circumstances, they have the right to apply for their evasion orally or in writing."). Similarly in JDDC, such problem caused 47% of errors (e.g., "Sorry, we cannot refund you with the product [#price, #style, #brand, #specification, #color] you required."). Such kind of law/product related information might need to leverage external expert knowledge. In addition, 23% and 36% of errors occur in CDD and JDDC respectively, when it comes to the long sentence to be generated (e.g., the sentence length is more than 30 words for the judge's inquiry or for customer service response).
To address these problems in the future research, enhancing the long dependence of language models and establishing relations between different entities can be promising approaches.

Pointer Network
Pointer network (Vinyals et al., 2015) is a special network structure. It solved the problem of generating sequence depending on the input sequence. Based on this basis, CopyNet (Gu et al., 2016) and PGN (See et al., 2017) were proposed, which can copy the words in the context and form output sequence to cope with Out-Of-Vocabulary (OOV) problem. Nowadays, pointer networks are increasingly popular in NLP applications. In text summarization, Miao and Blunsom (2016) used it to select only suitable words from context instead of the entire dictionary for sentence compression;  used it to speed up model convergence; Sun et al. (2018) used it to generate text title; Wang et al. (2019a) generated new conceptual words; Eric and Manning (2017) used it to develop a recurrent neural dialogue system;  used it to enhance power of capturing richer latent alignment. It was also widely used in many other tasks, such as dependency parsing (Fernández-González, 2019;Liu et al., 2019a), question answering (Kadlec et al., 2016;Golchha et al., 2019), machine reading comprehension , machine translation , and language models (Merity et al., 2016).
Unlike previous studies, on the basis of internal copy, we introduce external copy, to establish a cross-copy structure, and achieve significant im-provement.

Dialogue System
As an important task of NLP, dialogue system, has achieved great success and is widely used in practical applications, including customer service systems and chatbots. In recent years, with the development of deep learning technology, the neural network model has made significant progress:  solved the problem of information omitting and quoting in multiple rounds of dialogue by rewriting sentences; Lu et al. (2019) solved the problem of selecting the reply sentence in the dialogue system by adding features of time sequence and space; Du and Black (2019) solved the problem of lack of diversity in replies by using the dichotomy function to judge whether the two responses are similar.
There exist a number of prior studies to assist the task of dialogue generation through external knowledge: Wu et al. (2019) With the deepening of dialogue generation, various new tasks have been proposed: Le et al. (2019) generated the most appropriate response by given video content, video title, and existing dialogue sentences; Tang et al. (2019) introduced how to lead the conversation to a specific goal in an open conversation; Wang et al. (2019b) introduced how to use different persuasion strategies in the dialogue to persuade people to donate to charities; Cao et al. (2019) were concerned about the application of dialogue analysis in the psychotherapy.
In our model, the CCN approaches to solve the problem of defective domain adaptation without any costly external knowledge.

Conclusion and Outlook
In this paper, we proposed a novel neural network structure-Cross Copy Networks, enabling both vertical copy (from dialogue context) and horizontal copy (from similar cases). Unlike prior models, the proposed CCN doesn't need additional knowledge input, and it can be easily adopted to other domains. We conduct experiments on two different datasets with both quantitative and human evaluation to validate the proposed model. Experimental results proved CCN's superiority when comparing with a number of existing state-of-art text generation models, which tells the cross copy mechanism can successfully enhance the dialogue generation performance.
In future work, we will further investigate other content generation problems by leveraging multigranularity copying mechanism. This study serves as the methodological foundation.