Relevant and Informative Response Generation using Pointwise Mutual Information

A sequence-to-sequence model tends to generate generic responses with little information for input utterances. To solve this problem, we propose a neural model that generates relevant and informative responses. Our model has simple architecture to enable easy application to existing neural dialogue models. Specifically, using positive pointwise mutual information, it first identifies keywords that frequently co-occur in responses given an utterance. Then, the model encourages the decoder to use the keywords for response generation. Experiment results demonstrate that our model successfully diversifies responses relative to previous models.


Introduction
Neural networks are common approaches to building chat-bots. Vinyals and Le (2015) have proposed a neural dialogue model using sequenceto-sequence (Seq2Seq) networks (Sutskever et al., 2014) and achieved fluent response generation. Because a Seq2Seq model uses a word-by-word loss function at the time of training, any words outside the reference are penalized equally. Consequently, the Seq2Seq model tends to generate generic responses that consist of frequent words, such as "Yes" and "I don't know." This is a central concern in neural dialogue generation. To tackle this problem, Li et al. (2016) proposed a model for considering mutual dependency between an utterance and response modeled by maximum mutual information (MMI). However, their model disregarded the aspect of informativeness of responses, which is also important for user experience of chat-bots.
To solve this problem, we propose a response generation model that outputs diverse words while preserving relevance in response to the input utterance. In our model, Positive Pointwise Mutual Information (PPMI) identifies keywords from a large-scale conversational corpus that are likely to appear in the response to an input utterance. Then, the model modifies the loss function in a Seq2Seq model to reward responses using the identified keywords. In order to calculate the loss function using the words output by the decoder, we need to sample words from the probability distribution of the output layer. Hence, we apply the Gumbel-Softmax trick (Jang et al., 2017) as a differentiable pseudo-sampling method.
Experiments using a Japanese dialogue corpus crawled from Twitter and OpenSubtitles revealed that the proposed model outperformed (Li et al., 2016) for all automatic evaluation metrics for correspondence to references and diversity in outputs.

Related Work
The generic response problem has been actively studied. Yao et al. (2016) and Nakamura et al. (2019) proposed models that constrain decoders to directly suppress generation of frequent words. Yao et al. (2016) diversified the response by a loss function in which words with high inverse document frequency values are preferred. Nakamura et al. (2019) proposed a loss function that adds weights based on the inverse of the word frequency. Xing et al. (2017) proposed a model using topic words extracted from utterances. Their model ensembles words predicted using the topic words and the words predicted by the decoder. All of the methods described above only focus on the amount of a information in a response. Therefore, generated responses tend to lack relevance to input utterances. MMI-bidi (Li et al., 2016) solves this problem by approximating the PMI between the utterance Q and the generated (1) Here, both P (R|Q) and P (Q|R) are computed by independent Seq2Seq models. Specifically, the N -best candidate responses generated by the former model are re-ranked by Equation (1). MMIbidi exhibited a strong performance for diversifying responses while preserving relevance to an input utterance. However, its effects depend on the diversities of the N-best candidate responses. If these responses are diverse, MMI-bidi can improve futher.
3 Proposed Model Figure 1 shows the outline of the proposed model. It first identifies keywords that strongly co-occur between utterances and their responses in a training corpus using PPMI (section 3.1). The decoder then uses Gumbel-Softmax to sample words in the output layer (section 3.3). Finally, it computes the proportion of output words matching the keywords, and add weights to the loss function (section 3.4).

Keywords Retrieval Based on Positive Pointwise Mutual Information
The keyword handler retrieves words that are likely to appear in the response to a certain input utterance based on PPMI, calculated in advance from an entire training corpus. Let P Q (x) and P R (x) be probabilities that the word x will appear in a certain utterance and response sentences, respectively. Also, let P (x, y) be the probability that the words x and y exist in the utterance and response sentence pair. PPMI is calculated as follows: , 0 ) .
The pair of x and y and its PPMI score are saved in the PPMI Matrix in Figure 1. At the time of response generation, the keyword handler looks up the PPMI Matrix. Let the word set of a certain utterance sentence be Q = {q 1 , q 2 , . . . , q L }, and the vocabulary in the decoder be V Keyword-scores are calculated for all words in V R . Then top-k words are set as keywords V Pred used in the loss function.

Decoding Response Sentences using Retrieved Keywords
The decoder first receives a vector v f consisting of keyword-scores for all words in the vocabulary, and non-linearly transforms v f through a multilayer perceptron (MLP). This vector is concatenated with the output of the encoder, and then set to the initial state of the decoder. By doing so, we expect that the decoder considers the keywordscores. In order to directly boost the probability to output the keywords, we add weighted v f to the decoder output vector π i at each time step i. The final decoder outputπ i is represented by the following equation: λ i balances the effects of the decoder output and v f . λ i is calculated as follows based on the current intermediate state h i of the decoder: where W gate is a trainable weight matrix, b gate is a bias term, and σ(·) is a sigmoid function.

Pseudo-sampling of Generated Words using Gumbel-Softmax
In order to determine whether the decoder generated words in V P red, it is necessary to sample words generated by the decoder. However, sampling based on argmax, which is generally used at the decoder, disallows back propagation because of its discrete nature. Jang et al. (2017) proposed Gumbel-Softmax which performs pseudo sampling from the probability distribution to allow back propagation. Gumbel-Softmax performs the following calculations for a probability distribution π (corresponding to the output layer in the decoder) for k classes: Here, τ is a hyperparameter called temperature. Smaller t au makes the vector closer to one-hot but the dispersion of the gradient becomes larger. g i is obtained by the following calculation using uniform distribution u i ∼ Uniform(0, 1): In the proposed model, Gumbel-Softmax is applied to the final decoder output vectorπ at each time step i as in Equation (2). Then, we obtain the differentiable pseudo-bag-of-words vector B. (2)

Loss function
We design a loss function l v which value decreases as words contained in V Pred are generated. Thus, the decoder outputs more words that strongly cooccur with the input utterance. Specifically, when t(b n ) is the word corresponding to the n-th index in B, l v is defined as follows.
(3) We use min(b n , 1) in Equation (3) to avoid adding a reward when a keyword is generated multiple times. This aims to suppress the decoder outputs the same word many times.
Finally, the loss function L is defined as a liner interpolation of l CE of the cross-entropy error and the l v : α is a hyperparameter that balances the degree of rewards based on the keywords.

Experiments
We empirically evaluate how our model avoids generic responses to generate relevant and informative responses.

Datasets
We used two datasets, OpenSubtitles (English) and Twitter (Japanese). The details of each dataset are as follows.
OpenSubtitles OpenSubtitles (Tiedemann, 2009) is a large scale open-domain corpus composed of movie subtitles.
Like Vinyal et al. (Vinyals and Le, 2015) and Li et al (Li et al., 2016), we assumed that each line of the subtitles represents an independent utterance, and constructed a single-turn dialogue corpus by regarding two consecutive utterances as an utteranceresponse pair. We randomly sampled 2 million utterance-response pairs. All sentences were tokenized using the Punkt Sentence Tokenizer of nltk 1 .
Twitter We crawled conversations in Japanese Twitter using "@" mention as a clue. A single-turn dialogue corpus was constructed by regarding a tweet and its reply as an utterance-response pair. The dataset consists of about 1.3 million utterance-response pairs. All sentences were tokenized by MeCab 2 .
In both datasets, 10k utterance-response pairs were separated as validation data, another 10k were separated as test data, and the rest were used as training data.

Comparison Methods
We compared our model to previous models. The baseline is the standard Seq2Seq (Seq2Seq). We also compared to MMI-bidi (Seq2Seq + MMI)  because it is the most relevant method for diversifying responses. In addition, we combined our model with MMI-bidi (Proposed + MMI) to see whether it contributes to diversification of the Nbest candidates.

Evaluation Metrics
We employed several automatic evaluation metrics. BLEU and NIST measure the validity of generated sentences in comparison with references. BLEU (Papineni et al., 2002) measures the correspondence between n-grams in generated responses and those in reference sentences. Following Papineni et al. (2002), we used the average of BLEU scores from 1-gram to 4-gram in the experiment. NIST (Doddington, 2002) also measures the correspondence between generated responses and reference sentences. Unlike BLEU, NIST places lower weights on frequent n-grams, i.e., NIST regards content words as more important than function words. In the experiment, we used the average of NIST from 1-gram to 5-gram. In addition, dist and ent measure the diversity of generated responses. Dist (Li et al., 2016) is defined as the number of distinct n-grams in generated responses divided by the total number of generated tokens. On the other hand, ent (Zhang et al., 2018) considers the frequency of n-grams in generated responses: , where X is a set of n -grams output by the sys-tem, and F (w) computes the frequency of each n-gram.
In this paper, we focus on automatic evaluation. Human evaluation is our future work.

Parameter Settings
For all models, we implemented the encoder and decoder of each model using 1-layer GRUs. The dimension of the GRU was set to 512. However, only the decoder of the proposed model used 1024-dimensional GRU. This is because the initial state of the decoder is the concatenation of the keyword-score vector and the output from the encoder (512-dimension for each). Both the encoder and decoder had a word embedding layer of 256 dimensions.
The vocabulary consisted of words that appeared more than 15 times in the training data. Words that occured less than 15 times were replaced with the "<unk>" token. The vocabulary size was 41.5k for the Twitter model and 20.9k for the OpenSubtitles model.

Results and Error Analysis
The left sides of Tables 1 and 2 show BLEU, NIST, dist, and ent scores for OpenSubtitles and Twitter, respectively. Our model (Proposed) outperformed Seq2Seq and MMI-bidi (Seq2Seq) in all evaluation metrics across the datasets. Furthermore, our model combined with MMI-bidi (Pro-posed+MMI) achieved the best performance, except for NIST, on the Twitter dataset. This result demonstrates that our method successfully gener-   (Li et al., 2016) ates diverse responses, which effectively improves the N -best candidates reranked by MMI-bidi. It is notable that improvements on NIST, which appreciates less frequent n-grams, support this idea the proposed model improves the informativeness of responses. The improvement is larger on the Twitter dataset, where the proposed method (Proposed) achieved NIST score 0.265 points higher than Seq2Seq even though MMI-bidi is inferior to Seq2Seq.
The example responses generated by Pro-posed+MMI and Seq2Seq+MMI using OpenSubtitles are shown in Table 3. The examples from the top to the third rows show that the proposed model generates more content words relevant to the content words in the utterance. On the other hand, Seq2Seq+MMI ended up generating fewer informative responses using generic words. The fourth and fifth examples show that the proposed model generated responses with little relevance to the in-put, although they were more informative than the responses generated by Seq2Seq+MMI.
The last two examples show a drawback of the proposed model, i.e., which is over-generation of the same word. For quantitative evaluation, we computed the repetition rate (Le et al., 2017) on the test data, which measures the meaningless repetition of words. The repetition rate is defined as: where y i is the i-th generated sentence in the test data, Y i is its reference, and N is the total number of test sentences. The function r(·) measures the repetition as the difference between the number of words and that of unique words in a sentence: r(X) = len(X) − len(set(X)), where X means words in a sentence, len(X) computes the number of items in X, and set(X) re-moves duplicate items in X. The average lengths of generated responses and repetition rates are shown on the right sides of Tables 1 and 2. The results show that the proposed models (Proposed and Proposed+MMI) tend to generate longer responses than Seq2Seq, but their repetition rates are also higher. This may be caused by time-invariant keyword-scores, despite the fact that the decoder output changes over time. In the future, we will update the keyword-score vector to avoid repetition in responses.

Conclusion
Aiming at generating diverse responses while preserving relevance to the input, we proposed a model that identifies keywords using PPMI and promoted their generation in the decoder. Evaluation results using English and Japanese conversational corpora show that in comparison with (Li et al., 2016), the proposed model achieved better performance in terms of correspondence to references and diversity of output. On the other hand, we found that the proposed model has a tendency of over-generation.
As future work, we will conduct human evaluation and qualitative analysis. We will also investigate the effects of the hyper-parameter α on overall performance. We also plan to develop a mechanism for suppressing over-generation.