Topic-relevant Response Generation using Optimal Transport for an Open-domain Dialog System

Conventional neural generative models tend to generate safe and generic responses which have little connection with previous utterances semantically and would disengage users in a dialog system. To generate relevant responses, we propose a method that employs two types of constraints - topical constraint and semantic constraint. Under the hypothesis that a response and its context have higher relevance when they share the same topics, the topical constraint encourages the topics of a response to match its context by conditioning response decoding on topic words’ embeddings. The semantic constraint, which encourages a response to be semantically related to its context by regularizing the decoding objective function with semantic distance, is proposed. Optimal transport is applied to compute a weighted semantic distance between the representation of a response and the context. Generated responses are evaluated by automatic metrics, as well as human judgment, showing that the proposed method can generate more topic-relevant and content-rich responses than conventional models.


Introduction
In the past decade, personal assistants based on a speech dialog system have shown significant potential and become a trend, which have aroused great interest in dialog systems. Recently, because of the access to large public conversation datasets, neural response generation has been investigated for an open-domain dialog system. Neural response generation is to learn a response generation model under sequence-to-sequence (seq2seq) (Bahdanau et al., 2014) framework. Such a model is trained to optimize the conditional likelihood P (Y |X) of generating a response Y given its context X. This objective function is especially suitable for tasks like machine translation, of which the source and its target translation are semantically corresponding. However, for open-domain dialog response generation, a response can be weakly relevant to its previous utterances. There usually does not exist an obvious alignment between previous utterances and the response. Thus, neural generative models tend to generate safe and generic response, which contains little meaningful information, such as "really?", and "I don't know.". These generic responses are usually short and vague, and can seldom provide effective information as feedback to the other partner in the dialog. This has been identified as one of major challenges in neural response generation .
To ameliorate the problem of generic responses, we design a model that can generate responses that are relevant to previous utterances. Two constraints are introduced, 1) topical constraint and 2) semantic constraint, to modify the objective function P (Y |X). Responses which are highly related to their context usually have the same topic as the context does. Thus, the topical constraint conditions the model's decoding process on topic words in the context, which are identified by a topic word sequence labeler. Moreover, a relevant response should be semantically related to its context. The semantic constraint regularizes the model's predicted response by minimizing the semantic distance between the generated response and its context. Optimal transport (Peyré et al., 2019) is a method for finding a joint distribution of two distributions which minimizes the Wasserstein distance when converting one distribution to another. It is applied to the computation of semantic distance in the proposed method. We use both automated evaluation and human evaluation to verify the effect of our proposed method. Serban et al. (2016) proposed Hierarchical Recurrent Encoder-Decoder (HRED) model, which improves the traditional seq2seq model by applying a hierarchical encoder. While its decoder keeps the same, HRED's encoder uses an utterance-level RNN and a context-level RNN to model the intra-and interutterance dependencies, respectively. Given K previous utterance U = {U 1 , ..., U K } as input context, an utterance-level RNN first converts each utterance U k = {w k,1 , ..., w k,N } into a fixed-length utterance vector u k , then the context-level RNN takes as input the sequence of utterance vectors u 1 , ..., u K and produces a context vector c from its final hidden state and c is used to initialize the hidden states of HRED's decoder RNN. Although better performances can be achieved with attention mechanism, the problem of generic responses remains unsolved because the likelihood-based objective function Eq (1) still solely depends on the input context.

Neural Response Generation Model
The HRED model is used as the baseline model in this work.

Topical and Semantic Constraints
To generate responses that are more relevant to the context, topical constraint and semantic constraint into the HRED are explored.
If a response and its context have the same topic, then the response is likely to be relevant to the context. Topic words are keywords which have more impact on following conversation, and they can dynamically define the topic of a conversation. To push the topic of a response more relevant to its context, we train a sequence labeler to recognize topic words {t 1 , ..., t n } from a context and condition the decoding process on both the context and topic words. Then the objective function is modified as Eq (2) Here T = {t 1 , ..., t n } indicates the topic words extracted from the context. To further strengthen the semantic relevance between a generated response and its context, we try to close the representation of these two in the semantic space (Mikolov et al., 2013). We compute the semantic distance between a response and its context at the sequence level, and regularize the objective function with the distance as in Eq (3).
Here ∆ is the function computing the semantic distance between the response and the context; Emb() indicates the word embedding of a sequence of tokens. Theoretically, by adding the topical and semantic constraints into the HRED, we can have more control on selecting generated content.

Incorporating Topical Constraint
To encourage topics of a response to be relevant to the topic of its context, we first pre-train a sequence labeler to extract topic words from the context, then jointly train it with the HRED. We use topic words to refer to a topic.
Here topic words are defined as words that 1) contain meaningful information and 2) are shared by the context and the response. In the example in Table 1, words "sale" and "live" contains useful information, among which "live" has a higher correlation with the response "I have lived here for about twenty years." Therefore, it is reasonable to consider "live" as the topic word in this context-response pair.  In order to automatically identify such topic words in a context, a topic word dataset is constructed from a conversation corpus. This task is formulated as a sequence labeling problem, where each word in the context is labeled as either a topic word or a non-topic word.

Data Construction
Currently, there is no public data with topic word annotations, so the construction of such a dataset is necessary. According to the definition, a topic word can be filtered from context and a response by keywords, because keywords generally carry rich information. Then a keyword can be further filtered by checking whether the same keyword or a semantically related word appears in both the context and the response.
The training data comes from the DailyDialog corpus, which includes 13,118 open-domain humanhuman text conversations. The data is organized as {context, response} pairs. In the first step of keyword filtering, a public keyword extraction toolkit, YAKE (Campos et al., 2020), is used to extract keywords from both context and responses. These keywords are topic word candidates in the next step.
Since topic words are also expected to be shared by both the context and the response, two strategies are investigated to further filter the keywords extracted by YAKE.
• A hard matching strategy: Only keywords that appears in both a context and its corresponding response commonly are labeled as topic words. In this way, the context and response have the same topic words in the dataset.
• A soft matching strategy: Firstly, word embedding-based pairwise cosine similarity is computed between the context's keywords and the response's keywords, then the context-side words in the top 3 pairs are labeled as topic words. In this way, the context and response are expected to have relevant topics, rather than just the same keywords. Table 2 shows the statistics of the original and the constructed dataset. These are used as ground-truth topic words, which should be detected in the context and used in the following response.
The number of tokens in the dataset constructed following hard structure is much less than the other dataset because of the tough matching requirements.

BERT-based Sequence Labeler
Sequence labeling is a pattern recognition task that assigns a categorical label to each components in a sequence. In the task of topic word labeling, a model takes as input a sequence of context tokens X = {x 1 , ..., x N } and predicts a sequence of topic word labels Y topic = {y   works showed that by finetuning a pretrained BERT model, state-of-the-art results are achieved in sequence labeling tasks (Tsai et al., 2019;Shi and Lin, 2019). Thus, we also build the sequence labeler on the top of BERT.
BERT-based sequence labeler utilizes the output hidden states from BERT as the contextualized encoding of each word and uses a fully connected layer and a sigmoid function to compute the probability of each word being a topic word.
Cross entropy loss is used as the loss function for this model.

Response generation with topical constraint
To condition response generation on the extracted topic words as in Eq (2), a simple method is proposed to integrate the information from the HRED's encoder and the information from topic words' embeddings. Let c denote the context vector produced by the HRED's encoder, p i denote p(x i is topic word), the decoder's hidden states are initialized with an updated c that combines context information and topic word information. Here p i is calculated by Eq (5).
Here FC(·) is a fully-connected layer, Emb(·) is the embedding lookup table.

Incorporating Semantic Constraint
Another approach to generate more relevant responses is to consider the semantic distance between a generated response and its context. In some past works, the semantic distance was calculated by the similarity of the sequence-level semantic representations. However, it is possible to lose details in the process. We propose a novel method to calculate the semantic distance from a word-level representation. Many word alignments between a response and the context may not be meaningful, such as an alignment of two irrelevant stop words. It is reasonable that we should value the alignments with strong semantic connection. Thus, a semantic distance considering the weights of different alignments is needed.

Optimal Transport
Assuming X and Y are two references belonging to the same space, an optimal transport (OT) problem can be described as following.
where P is a set of joint distributions, with marginal distribution X and Y , and each P is called a transport plan; c(x i , y j ) is a cost function for moving from x i to y j , C is the cost matrix given by c(x i , y j ); P, C represents a dot product. Therefore, an OT problem can be regarded as to find a transport plan which minimizes the OT distance -the weighted cost of moving from X to Y .

Decoding with Optimal Transport
In the OT problem formulated in Eq (7), the important components are transport plans and a cost matrix. We compute the OT distance of textual conversation data to provide sequence-level guidance to the decoding process as shown in Eq (3).
Many popular word embedding representations can be regarded as distributions over a semantic space. When applied to textual conversation data, values of the cost matrix can be the pairwise cosine distance between word embeddings of the context and the predicted response. For word embeddings which are close in semantic space, the cosine distance is correspondingly small, and the cost of moving from one embedding to another is small.
A transport plan, which is a joint distribution of the context word embedding and the predicted response word embedding, represents the global semantic matching between the context tokens and the response tokens. To minimize the OT distance, an optimal transport plan should assign a large weight to the token pairs, in which the cost is smaller to move from one word embedding representation to another. Therefore, the dot product of an optimal transport plan P showing the degree of alignments and the cost matrix C showing the initial semantic distance can be regarded as a weighted semantic distance, where strongly relevant alignments are emphasized. The weighted semantic distance is calculated as follows: When matching words in a response with words in the context with optimal transport, as shown in Figure 1, optimal transport matching can not only capture the semantic relationship among the tokens but also the degree of the relationship. Figure 1: Match words using optimal transport between a response and its context. Darker lines indicate word alignments with stronger semantic connection.
By adding the factor OT distance into the objective function Eq (3), the model are expected to generate responses that are semantically similar to the context.
We solve the OT problem by approximating it with entropy-regularized OT as follows: where D is the OT distance; entropy H(P ) is computed as follows: The entropy-regularized OT is a strongly convex function, thus it has a unique solution. Theoretically, the solution of the approximated OT, P , convergence to the optimal solution of OT as approaches to 0 (Genevay, 2019).
Sinkhorn iteration (Cuturi, 2013) is an efficient algorithm for entropy-regularized OT. The algorithm is as follows.
Algorithm 1: Sinkhorn interation for entropy-regularized OT Data: cost matrix C, regularization coefficient , maximum iteration number max iter Result: P initialize µ, ν as uniform distribution

Experimental Evaluations
To investigate the impact of the datasets constructed in Section 4 on the topic word labeler, the sequence labeler is trained and tested on the two different datasets. To confirm the effect of the proposed methods in improving the response quality, experiments are conducted using the DailyDialog corpus.

Experiments on Bert-based Sequence Labeler
The performance of the sequence labeler is evaluated by automatic metrics of precision, recall and F1 score. Taking the imbalance of data into consideration, these metric scores are calculated under different conditions as follows: • Binary: Calculating scores regarding topic words.
• Macro: Calculating scores regarding topic words and non-topic words separately, then calculating the unweighted average of the scores of two labels.
• Coverage: Calculating scores regarding topic words without repeated tokens.
The results for sequence labeler trained on two different datasets is shown in Table 3. It shows that precision and F1 scores are higher when the model is trained with DailyDialog topic soft, and it is reasonable since the ratio of topic words is much higher in DailyDialog topic soft than that of DailyDialog topic hard as in Table 2  The following models are assessed by automatic metrics and human evaluation.
HRED. The HRED model with a hierarchical context encoder and a response decoder. This HRED model uses a multi-head attention layer with the head size of 4.
HRED-Topic. The HRED model with a sequence labeler on topic words. There are two sequence labelers trained on different datasets, which are introduced in Section 4 and evaluated in Section 6.1. It has two variations.
• HRED-Topic-Hard. This is trained with DailyDialog topic hard.
• HRED-Topic-Soft. This is trained with DailyDialog topic soft.
HRED-OT. The HRED model using the weighted semantic distance between the context and the response to regularize the objective function, where the semantic distance is calculated by optimal transport. The regularization coefficient of OT is 0.5, and the number of the maximal iteration is 50. HRED-Topic+OT. The HRED model with a sequence labeler and an optimal transport layer. The sequence labeler is trained on DailyDialog topic hard.

Automatic Evaluation
The models are assessed by the following metrics. BLEU-2 calculates the unigram and bigram overlap of the predicted response and the ground truth response, then takes the average of them (Papineni et al., 2002).
Embedding similarity calculates the cosine similarity between a predicted response and the ground truth response. There are different ways of choosing representation: average, extrema, greedy, denoted by Emb-A, Emb-E, Emb-G (Foltz et al., 1998;Forgues et al., 2014;Rus and Lintean, 2012).
Distinct calculates the ratio of unique unigram and bigram entries in a predicted response, denoted by Dist-1 and Dist-2. Distinct scores are calculated regarding both the response itself and all responses in the corpus, denoted by Intra Dist and Inter Dist . Distinct scores discover the diversity of the predicted response. Table 4 shows the automatic evaluation results of all models. All proposed models outperform the baseline model HRED in terms of the BLEU-2 score, and HRED-Topic-Hard even surpasses the baseline model by one point in the BLEU-2 score. When compared with embedding similarity, HRED-OT performs slightly better than the others, showing that OT has a positive impact on reducing the semantic distance of a response and its context. HRED-Topic-Hard outperforms HRED-Topic-Soft in the experiment. One possible reason is that by following the soft construction strategy, data pairs where only weak connection between a response and its context exists are remained in the dataset. In the later parts, HRED-Topic-Soft is not used here after.

Human Evaluation
The proposed methods are expected to generate topic-relevant responses, which are not evaluated only with automatic evaluation. Hence, human evaluation is also conducted to assess the proposed methods. One hundred conversation samples are prepared for human evaluation. Each sample includes responses from all models, which is evaluated by 3 annotators from Amazon Mechanical Turk. A topic-relevant response needs to address two aspects of readability and relevance. Hence, responses from each model are scored on two categories.
It is worthy investigating the proportion of acceptable responses and high-quality responses. Two additional criteria are defined as follows.
• Acceptance rate: the proportion of responses whose grammar and relevance are scored no less than 2.
• High quality rate: the proportion of responses whose grammar and relevance are scored as 3.  From Table 5, we observe that the proposed model incorporating both constraints surpass the others in both grammar and relevance scores. This integrated model generates more acceptable response but less high quality response, compared to the baseline model.

Model
Generic responses can be reasonable regarding a lot of contexts, and they're less likely to make grammar mistakes. Looking through human evaluation examples, it is found that some of generic responses are also scored as high quality response. Hence, it is necessary to calculate the scores within non-generic and content-rich responses. A generic response is typically short, resulting in higher grammar scores. It is reasonable to think that a longer response is more content-rich than a short response, and it is less likely to be a generic response. We define content-rich responses as responses that are longer than the average length of human responses in the corpus, that is 13. The ratio of content-rich responses of all models are shown in Table 6.

Model
Ratio of content-rich responses HRED 26% HRED-Topic 35% HRED-OT 26% HRED-Topic+OT 30% Human evaluation results are shown in Table 7 by focusing on these content-rich responses, we observe that the integrated model has increment in both acceptance rate and high quality rate, while the baseline model decrement in both. The results shows that the model integrating both proposed constraints can generate more relevant and content-rich responses.

Related Work
Our proposed work is related to some past works. To generate more content-rich responses, a seq2seq model with an additional topic and semantic constraint that are based on a topic model (Griffiths et al., 2005) and cosine similarity (Arora et al., 2016) has been proposed, showing promising results in the content richness of generated responses (Baheti et al., 2018). Xing et al. also used topic information as side information and showed good results, where the topics are obtained from a pre-trained LDA  model (Xing et al., 2017). To improve the model structure, Mou el al. proposed a model that generates responses backward and forward starting from a keyword, which is a noun with the highest pointwise mutual information (PMI) score as a hard constraint, outperforming traditional Sequence-to-Sequence model (

Conclusion
A topic-relevant response should have a common or similar topic with its context, and it should be semantically related to the context. To generate topic-relevant responses and avoid generic responses like "I don't know", two constraints placed onto the decoding process are proposed. The first constraint is a topical constraint. To extract topic information from the context, a sequence labeler is trained on two differently constructed dataset with topic word annotations. The topic word information predicted by the sequence labeler is then integrated into the context vector generated by the hierarchical encoder to guide the decoding process.
The second constraint is a semantic constraint. A model is designed to generate responses semantically related to context by adding the semantic distance is added into the global loss of the model. In this work, semantic distance is calculated by optimal transport, which gives the optimal alignments between context and its response in a semantic space. The calculated semantic distance is used as a regularization for the objective function, giving sequence-level guidance on the decoding process.
The experimental evaluations demonstrated that combining these constraints lead to responses that are content-rich and with higher relevance to the context while maintaining good performance in the grammar for long sentences. This work has some distinct features in 1) using topic words to dynamically refer to topics; 2) providing with semantic sequence-level guidance, but with access to word-level similarity values.