Consistent Response Generation with Controlled Specificity

We propose a method to control the specificity of responses while maintaining the consistency with the utterances. We first design a metric based on pointwise mutual information, which measures the co-occurrence degree between an utterance and a response. To control the specificity of generated responses, we add the distant supervision based on the co-occurrence degree and a PMI-based word prediction mechanism to a sequence-to-sequence model. With these mechanisms, our model outputs the words with optimal specificity for a given specificity control variable. In experiments with open-domain dialogue corpora, automatic and human evaluation results confirm that our model controls the specificity of the response more sensitively than the conventional model and can generate highly consistent responses.


Introduction
Open-domain response generation is a task for generating a human-like responses to chit-chatting. There are many end-to-end response generation models (Vinyals and Le, 2015;Sordoni et al., 2015;Mei et al., 2017) that apply a sequence-to-sequence (Seq2Seq) (Sutskever et al., 2014) architecture, which allows the generation of fluent responses. However, the Seq2Seq model suffers from a tendency to generate safe but overly typical responses (i.e. dull responses), such as "Yes" and "I don't understand." To solve this problem, several studies proposed methods to increase the specificity of the generated responses (Li et al., 2016a;Zhang et al., 2018b;Jiang et al., 2019); however, simply maximizing the specificity of the response results in a degenerative solution that generates a specific but inconsistent responses.
In this study, we define the conditions that an automatically generated response is expected to satisfy as (i) being consistent with an input utterance, (ii) being specific to provide informative contents, and (iii) being controllable. As shown in Figure 1, in a human conversation, an utterance could have various responses with different specificity (Csáky et al., 2019). Then, humans control the specificity of the response as necessary. Thus, instead of only generating highly specific responses, the specificity should be controllable in response generation tasks.
We propose a method to control the specificity of responses while maintaining their consistency with the utterances. Following the observation that a response uniquely co-occurring with a specific utterance in a corpus is both specific and consistent for the utterance, we design a metric called MaxPMI, which measures the co-occurrence degree between an utterance and a response on the basis of positive pointwise mutual information (PPMI). We apply the distant supervision into our model using automatically annotated MaxPMI scores of the training set. At the inference, the specificity of the generated responses can be controlled by inputting a desired specificity level. We also propose a method to automatically set the specificity level by estimating the maximum MaxPMI score for an input utterance, which allows the generation of a response which has the maximum mutual information with the input.
We conducted both automatic and human eval-uations using DailyDialog and Twitter corpora. The results confirmed that our method largely outperformed the methods in previous studies and achieved sensitive control of the specificity of the output responses.

Related Work
Previous studies focus on addressing the dull response problem generated by Seq2seq models. Li et al. (2016a) rerank the N -best generated responses using an objective function to maximize the mutual information between the utterance and generated sentences. Because this method is postprocessing, it ceases to be effective if there are no appropriate response candidates among the Nbest responses. To directly improve the specificity of each response generated, previous studies devised training mechanisms of Seq2seq models by penalising for the generation of dull responses and eventually training models to generate specific responses. Yao et al. (2016) and Li et al. (2016b) apply reinforcement learning, and Xu et al. (2017) and Zhang et al. (2018b) apply generative adversarial networks, to directly generate specific responses. Based on the hypothesis that the specificity of sentences increases with the number of low-frequency words, Nakamura et al. (2019) and Jiang et al. (2019) propose loss functions weighted by word frequency. In contrast, to ensure both specificity and consistency, Takayama and Arase (2019) propose a model that directly promotes the generation of words that co-occur with uttered sentences on the basis of PPMI. Their model includes a mechanism for deciding whether or not to generate words of high co-occurrence with the utterance at each decoding step. In this study, we apply this method to our model for proactively generating specific words in a response. Controlling the properties of generated responses is also related to our study. Xu et al. (2019) and Ko et al. (2019) allow for the control of dialogue-acts, length, and specificity of responses; however, they are resource intensive and thus require an external annotated corpus. In contrast, SC-Seq2Seq (Zhang et al., 2018a) achieves control of response specificity without dependence on external resources, which is most relevant to our study. Moreover, SC-Seq2Seq applies distant supervision, but uses word frequency in responses as a measure of specificity. At inference, SC-Seq2seq requires to input a desired specificity realized in the response.
We measure specificity based on PPMI between an utterance and response, hence, our method can maintain both specificity and consistency to the utterance. Additionally, our method can estimate the maximum specificity for each input utterance, and automatically adjust the specificity of generated responses.

Proposed method
The proposed method is depicted in Figure 2. In the proposed method, first, a label that indicates the cooccurrence degree between utterance and response is automatically annotated by MaxPMI score (Section 3.1). The model generates sentences on the basis of previously calculated PPMI and MaxPMI (see Section 3.2). The training is performed using the framework of distant supervision based on the utterance-response pair and the MaxPMI score given beforehand (Section 3.3). At the inference, responses are generated using one method of inputting a manually determined specificity level or automatically estimated specificity level considering the input utterance (see Section 3.4).
Since we aim to explicitly control the amount of information in response to utterances, we use the decoder architecture of Takayama and Arase (2019) which has an output gating mechanism that controls whether or not to generate specific words at each decoding time-step.

MaxPMI: Co-occurrence measure between response and utterance
We propose a simple PPMI-based co-occurrence measure, called MaxPMI, which is based on the observation that a consistent and highly specific response contains words that highly co-occur with a specific utterance. First, the PPMI of each word is calculated in advance using the all training corpus. X = {x 1 , x 2 , . . . , x |X| } is a word sequence in an utterance sentence, and Y = {y 1 , y 2 , . . . , y |Y | } is a word sequence in a response sentence. If the probabilities of word x of appearing in the utterance and response sentences are p X (x) and p Y (x), respectively, and if the probability of words x and y of simultaneously appearing in a certain utteranceresponse pair is p(x, y), then the PPMI is calculated as follows: PPMI(x, y) = max log 2 p(x, y) p X (x) · p Y (y) , 0 . MaxPMI is defined as follows: When training the model, MaxPMI shall be normalized to the range of [0, 1] by using min-max normalization.

Model Architecture
Our model is based on Seq2seq architecture, which consists of an encoder and decoder, as follows.
Encoder Like in normal Seq2Seq, the tokens in the input sentence are first vectorized using the embedding layer, following which the input sentence is encoded using the gated recurrent units (GRU) (Cho et al., 2014) to obtain the vector h GRU . In addition, the proposed method includes a multilayer perceptron (MLP), which encodes the input MaxPMI score (MaxPMI(X, Y )) as h s . Subsequently, h GRU and h s are concatenated to form a vector h e = {h GRU ; h s }, which is input to the decoder. The vector h s conveys the decoder to the level of specificity with which the response should be generated.
Decoder The decoder has the same architecture as that in Takayama and Arase (2019), which promotes the generation of words of high cooccurrence with an input utterance. Let V be the vocabulary of the decoder. A word co-occurrence degree d v between a word v ∈ V and an input sentence X is defined as follows: The decoder first receives a vector v f = [d 0 , . . . , d |V | ] ∈ R |V | that contains the word cooccurrence degrees of all the vocabulary words. It then encodes v f into a vector h v using the multilayer perceptron (MLP).
The initial state h = {h e ; h v } of the decoder is concatenation of h v and the encoder output h e . Consequently, the decoder can obtain the information of a word that co-occurs easily with the input.
In addition, v f is added with weighting to the output vector π i of the decoder in each time step i to amplify the output probability of a word having a high amount of mutual information with the input sentence. The final outputπ i of the decoder is given as follows: where generation of specific words is controlled by a parameter λ. We employ a gating mechanism using a sigmoid function (See et al., 2017) to determine the value of λ. Although previous literature discussed that the vanishing gradient problem could be caused by a sigmoid function (Goldberg and Hirst 2017, on page 46), See et al. (2017) have shown that the sigmoid-based gating is highly stable. λ i is computed according to the decoder's current intermediate state h i as follows: where W gate is the trainable weight matrix and b gate is the bias term.

Distant Supervision
MaxPMI score of an utterance-response pair (X, Y ) in the training corpus is calculated for the distant supervision beforehand (Section 3.1). These scores are then input to the the decoder as h s for training. The cross-entropy loss is used as the loss function: where D denotes a training set and the model parameters are θ. Intuitively, this loss function allows the model to learn what response should be generated conditioned on an utterance and a specificity level.

Inference
At the inference, we can control the specificity of a response by inputting the score s ∈ [0, 1] to the model. The larger s makes the response more specific, i.e. the response contains words that frequently co-occurred among the utterances and responses of the training corpus. Users of our conversation model determine the desired specificity according to their use cases.
Situations also arise in which users prefer automatic control of the response specificity (rather than controlling it themselves). An appropriate value of s depends on an input utterance, i.e. there are utterances that could have specific responses or only typical responses. For example, the utterance in Figure 1 may have specific responses as depicted, but the utterance "Hello." most likely has typical responses like "Hi." Hence, we propose a method for estimating the appropriate s to generate a maximally specific response possible for the utterance. We define the upper bound of MaxPMI, s max , for input sentence X as: which can be calculated using the precomputed PPMI values. By using s max , the most specific response among possible responses of varying specificity to X is expected to be generated (referred to as information-maximization decoding).

Experimental Settings
To evaluate whether our model can control the specificity of the responses while maintaining their consistency with the utterances, we conducted response-generation experiments using Japanese and English chit-chat dialogue corpora.

Experiment Corpora
We used two corpora, Twitter (Japanese) and Dai-lyDialog (English). The details of each corpus are as follows.
Twitter We crawled online conversations on Japanese Twitter by using the mentions of "@" as clues. A single-turn dialogue corpus was constructed by considering a tweet and its reply as an utterance-response pair. The sizes of the training/validation/test sets were 1, 383, 424/24, 123/25, 010 utteranceresponse pairs, respectively. Each utteranceresponse pair was divided into subwords using a BertJapaneseTokenizer (bert-base-japanese) in transformers 1 (version = 2.5.1).
DailyDialog This corpus was constructed by Li et al. (2017) by crawling various websites that taught users English dialogues for daily usage. This consists of multi-turn dialogues, which we converted to a singleturn dialogues by considering two consecutive utterances as an utterance-response pair. The sizes of the training/validation/test sets were 76, 052/7, 069/6, 740 utteranceresponse pairs, respectively. Each utteranceresponse pair was divided into subwords using a BertTokenizer (BERT-base-uncased) in transformers.
As pre-processing, the subwords with frequencies less than 50 for both corpora were excluded for calculating the PPMI.

Comparison Methods
We compared our model to previous models. The baseline is the standard Seq2Seq (Seq2Seq). We also compared our model to SC-Seq2Seq (Zhang et al., 2018a) as it is the most relevant method for controlling the specificity of responses. SC-Seq2Seq is a response generation model that can control the specificity of output sentences using the distant supervision. It hypothesizes that the lower the frequencies of words in a sentence, the higher the specificity of the sentence. As a measure of sentence specificity, it uses a frequencybased metric; inverse frequency of words. Moreover, SC-Seq2Seq also has a word prediction mechanism based on the Gaussian kernel layer in addition to the output layer of the decoder. Unlike our model, which takes into account the co-occurrence between utterances and responses, this word prediction layer takes into account the rarity of words. At the inference, the specificity of a response is controlled by inputting the specificity score ∈ [0, 1].

Metrics for Automatic Evaluation
We employed several automatic-evaluation metrics typically used in the evaluation of conversation systems.
Metrics for Validity First, we evaluated the validity of the generated sentences in comparison with the reference sentences (responses) using BLEU and NIST. BLEU (Papineni et al., 2002) measures the correspondence between the n-grams in generated responses and those in the reference sentences. Liu et al. (2016) empirically show that BLEU has a higher Spearman's correlation with 5scale human evaluation than some other referencebased metrics in experiments using the English Twitter corpus. NIST (Doddington, 2002) also measures the correspondence between generated responses and reference sentences. Unlike BLEU, NIST places lower weights on frequent n-grams, i.e. NIST regards content words as more important than function words. Thus, we regard that NIST is more suitable for evaluating the specificity aspects of the responses. We used Natural Language Toolkit 2 for calculation of BLEU and NIST scores.
Metrics for Diversity Second, we evaluated the diversity of the generated responses using dist and ent. Dist (Li et al., 2016a) is defined as the number of distinct n-grams in the generated responses divided by the total number of generated tokens. On the other hand, ent (Zhang et al., 2018b) considers the frequency of n-grams in generated responses as follows: where Y is a set of n-grams output by the system, and F (w) computes the frequency of each n-gram. Compared to dist, which simply focuses on the number of types of words used in a response, ent focuses on the specificity of the response.
Metrics for Fluency Finally, we evaluated the repetition rate (Le et al., 2017) on the test set, which measures the meaningless repetition of words: where Y i is the i-th generated sentence, Y i is its reference, and N is the total number of test sentences. The function r(·) measures the repetition as the difference between the number of words and that of unique words in a sentence: where Y means a sentence, len(Y ) computes the number of words in Y , and set(Y ) removes the duplicate words in Y .

Human Evaluation Settings
Because appropriate responses for a certain utterance are diverse, human evaluation is crucial to properly evaluate conversation systems. We conducted human evaluation using the Japanese Twitter corpus. Specifically, we recruited six raters via crowd-sourcing, who were all Japanese native speakers and active users of Twitter. The raters evaluated the quality of 300 responses that were generated for randomly sampled utterances from the test set. All raters annotated the same set in parallel; each rater evaluated all the systems. In addition, we shuffled the set of responses to an utterance so that the raters did not distinguish which model each response was output from. The raters were recruited using Lancers, 3 a popular Japanese crowd-sourcing service. The evaluation criteria were the same as those used in (Zhang et al., 2018a): +2: the response is not only semantically consistent and grammatical, but also specific; +1: the response is grammatical and can be used as a response to the utterance, but is too trivial (e.g., "I don't know"); +0: the response is semantically inconsistent or ungrammatical (e.g., grammatical errors). After collecting results from the raters, we adopted the results of the five raters and excluded one who had extremely low agreements with the others.

Model Settings
We used Adam (Kingma and Lei Ba, 2015) as an optimizer for training all the models with the learning rate to 0.0002. We also used gradient clipping  to avoid the exploding gradient problem, with a threshold of 5. For all the models, the number of dimensions of the hidden and embedding layers was 512 and 256, respectively. The training was performed up to 40 epochs on Twitter corpus and 200 epochs on DailyDialog corpus, and the evaluation was conducted using the model with the highest BLEU score on the validation set. SC-Seq2Seq has a hyper-parameter σ 2 , which determines the variance of the Gaussian kernel layer. σ 2 was set to 0.1 for Twitter and 0.2 for DailyDialog, chosen from 0.1, 0.2, 0.5, and 1.0 to maximise the BLEU score on the validation set.
All the code used in the experiment was written using PyTorch 4 (version = 1.0.0). We use a single GPU (NVIDIA Tesla V100 SXM2, 32 GB memory) for both training and testing.

Automatic Evaluation Results
The automatic evaluation results on the test sets are presented in Tables 1 (Twitter) and 2 (Daily-Dialog), where the last columns show the average number of words per response. The proposed method (s = s max ; information-maximization decoding) achieved the highest scores on validity and diversity metrics (BLEU, NIST, dist, and ent) for most cases. These results confirms that the information-maximization decoding can generate a highly specific response by estimating the appropriate specificity level s. Compared with other methods, our model achieved much higher BLEU and NIST scores on DailyDialog. We hypothesize that this was because our model explicitly incorporates the co-occurrence statistics of words, which may complement the training of Seq2seq with a smaller corpus.
SC-Seq2seq showed comparable BLEU and NIST scores to our model on the Twitter corpus; however, its dist and ent scores were as low as Seq2seq. In contrast, SC-seq2seq scored high for dist and ent on the DailyDialog corpus, but its BLEU and NIST scores were lower than the standard Seq2seq. These results indicate that the effectiveness of SC-Seq2seq is domain dependent. We conjecture this is caused by the specificity estimation based on word frequencies regardless of utterances and responses, which is easily affected by occurrence of rare words.
As an adverse effect of the proposed method, the repetition rate is higher than that of Seq2Seq and SC-Seq2Seq in both corpora. The longer average length of responses and higher NIST and BLEU scores of the proposed model indicates that highly co-occurring words (in references) are repeatedly generated. This is because the probability of generating such words is always high, regardless of the state of the decoder, and it will be generated repeatedly. We will address this problem by adjusting v f at each time-step in future.

Controllability Evaluation Results
We evaluated the controllabiity of the specificity of the generated responses using the automatic evalu-   ation metrics. For each utterance of the validation set, responses were generated using our model and SC-Seq2Seq, respectively.
The results are summarized in Table 3 (Twitter). Our model shows more sensitive variation for changing s than SC-Seq2Seq. Particularly, in the range of s ≤ 0.5, as s increases, dist, which indicates diversity, and NIST, which indicates validity of responses, increase. However, in the range of s ≥ 0.5, as s increases, almost all the scores decrease. These results show that it is impossible to generate an appropriate response when the inputted specificity level s is beyond the possible range for input utterances. It is evident that the repetition rate ('rep' in Table 3) and average length of responses increased as s became larger. This is because the decoder prefers words co-occurring with the utterance in accordance with a large s; and consequently, it repeatedly generated highly specific words for utterances.
The results of the proposed method (s = s max ) show the highest scores for all of BLEU, NIST, dist, and ent. Further, it achieves the lower repetition rate than the proposed method (s = 0.5), which performed best among different settings of s. This results show that the optimal s for each input utterance can be estimated by using informationmaximization decoding. The same tendency was also observed in the DailyDialog corpus, whose results are omitted due to the space limitation.

Human Evaluation Results
The human evaluation results on the test set of Twitter corpus are presented in Table 4. Except for the proposed method (s = 0.0), the Kappa values for all the methods exceed 0.4. These Kappa values are similar to those obtained in the human evaluations performed in Zhang et al. (2018a). The low kappa value of 0.02 for the proposed method (s = 0.0) is caused by the frequent output of very short responses 5 such as "?" and "huh?", thereby making it difficult to determine whether a response is acceptable.
The proposed method (s = 0.5) and the proposed method (s = s max ) have more "+2"s than the proposed method (s = 0.0), which shows that our model generates specific responses by increasing s. The change in the ratio of the number of "+2"s to the change in s is more pronounced for our model than for each of the SC-Seq2Seq results. Thus, our model possesses more sensitive specificity control than SC-Seq2Seq. However, both of the proposed methods and SC-Seq2Seq show a significant increase in the rate of "+0" upon increasing s, compared to Seq2seq. This is because the fluency of the responses was deteriorated by forcing to output a larger number of specific words, which negatively affected to the language generation ability of the decoder. Particularly, as mentioned in Section 5.2, many responses might have lost their fluency because of repeated words.
To address this problem, we tried a simple heuristic to switch the proposed method and the plain Seq2seq. If the proportion of unique words in a re- Can't agree more (それな) sponse sentence generated by our model falls below a threshold T (we set T to 0.95), i.e. the response contains repetitive words, we switch to the plain Seq2seq and use its response instead. The results obtained after applying this heuristic to the proposed method (s = s max ) as well as SC-Seq2Seq (s = 1.0) are listed in Table 4 as the proposed method (hybrid) and SC-Seq2Seq (hybrid), respectively. For both the proposed method (hybrid) and SC-Seq2Seq (hybrid), the ratio of "+0" decreases by more than 15 percentage points, while that of "+2" remains almost unchanged. This problem will be addressed using a more sophisticated approach in future work. Table 5 presents two examples of generated responses sampled from the test set of the Twitter corpus. In the range of s ≥ 0.5, our model generated highly specific responses to the utterances. However, it repeatedly generated the same phrase when s was too large, i.e. the response on s = 0.8 for the second case. As mentioned in the Section 5.1, this is an adverse effect of forcing to output a larger number of specific words than possible. In contrast, the information-maximization decoding (s = s max ) avoids this problem by adaptively setting an appropriate s value for each input utterance. SC-Seq2Seq often produced more specific responses than Seq2Seq as shown in the second example. However, the change in the specificity of responses is limited even though inputting a large value of s, like the first example. Specifically, the response by SC-Seq2Seq (s = 0.8) in the first case ignores the input utterance and thus is inconsistent. We conjecture this is caused by that the specificity in SC-Seq2Seq is estimated regardless of utterances and responses. For the same example, our model can output words that are associated with the utterance, such as "cat", "movie", and "cute".

Conclusion
We empirically showed that the co-occurrence relationship between words in an utterance and words in its response helps to control the specificity in response generation. The conventional specificity control model often generates responses with less consistency with the utterances. In contrast, our model can control specificity of the responses while maintaining the consistency with the utterance.
As future work, we shall improve the proposed method to maintain the fluency in responses by addressing the repeated word problem. Further, an appropriate specificity level of a response depends on the previous utterances and responses, i.e. conversation systems that always return highly specific responses are annoying. Hence, we intend to propose a method to adjust the specificity level considering the conversation history.