Importance of Search and Evaluation Strategies in Neural Dialogue Modeling

We investigate the impact of search strategies in neural dialogue modeling. We first compare two standard search algorithms, greedy and beam search, as well as our newly proposed iterative beam search which produces a more diverse set of candidate responses. We evaluate these strategies in realistic full conversations with humans and propose a model-based Bayesian calibration to address annotator bias. These conversations are analyzed using two automatic metrics: log-probabilities assigned by the model and utterance diversity. Our experiments reveal that better search algorithms lead to higher rated conversations. However, finding the optimal selection mechanism to choose from a more diverse set of candidates is still an open question.


Introduction
There are three high-level steps to building a neural autoregressive sequence model for dialog modeling, inspired by work of Vinyals and Le (2015). First, decide on a network architecture which will consume previous utterances as well as any extra information such as speaker identifiers. Second, choose a learning strategy. Finally, decide on a search algorithm, as neural autoregressive sequence models do not admit a tractable, exact approach for generating the most likely response.
Recent research in neural dialogue modeling has often focused on the first two aspects. A number of variants of sequence-to-sequence models Kalchbrenner and Blunsom, 2013) have been proposed for dialogue modeling in recent years, including hierarchical models  and transformers (Mazaré et al., 2018;Yang et al., 2018). These advances in network architectures have often been accompanied by advanced learning algorithms. Serban et al. (2017) introduce la-tent variables to their earlier hierarchical model and train it to maximize the variational lower bound, similar to Zhao et al. (2017) who propose to build a neural dialogue model as a conditional variational autoencoder. Xu et al. (2017) and Li et al. (2017b) train a neural dialogue model as conditional generative adversarial networks (Mirza and Osindero, 2014). These two learning algorithms, variational lower-bound maximization and adversarial learning, have been combined into a single model by Shen et al. (2018), which has been followed by Gu et al. (2018).
Despite abundant endeavors on modeling and learning, search has received only a little attention (Dinan et al., 2019). Most of the work on search has focused on training an additional neural network that provides a supplementary score to guide either greedy or beam search. Li et al. (2015) propose a maximum mutual information criterion for decoding using a reverse model. This has been extended by Li et al. (2017a), where an extra neural network is trained to predict an arbitrary reward given a partial hypothesis and used during decoding. Similarly, Zemlyanskiy and Sha (2018) train a neural network that predicts the other participant's personality given a partial conversation and use its predictability as an auxiliary score for re-ranking a set of candidate responses. None of these approaches study how the choice of the underlying search algorithm, rather than its scoring function, affects the quality of the neural dialogue model.
In this paper, we investigate the effects of varying search and selection strategies on the quality of generated dialogue utterances. We start with an attention-based sequence-tosequence model  trained on the recently-released PersonaChat dataset (Zhang et al., 2018). We evaluate three search algorithms: greedy search, beam search and iterative beam search, the last of which we design based on ear-lier works by Batra et al. (2012). These algorithms are qualitatively different from each other in the size of subspace over which they search for the best response.
We compare all of these alternatives using two families of metrics. First, we use human evaluation of full, multi-turn conversation. The resulting distribution of annotator's scores has huge variance that is rarely discussed nor analyzed by other groups. This variance comes from each annotator's individual attitude towards and understanding of the task, which we call annotator bias. In order to address this bias, we propose modelbased Bayesian calibration that explicitly factors in each annotator's bias and the algorithm's underlying score, and report the posterior mean and variance of each algorithm's score. Additionally, we also compare automatic metrics that capture the model's intrinsic preference (log-probability) and the diversity of responses (distinct-n).
We make two key observations from the experiments. A better search strategy can indeed generate responses that are both intrinsically preferred by the underlying model and diverse, without re-designing or re-training the neural dialogue model. However, this observation does not necessarily carry over to human evaluation, as the best performing strategy according to these automatic metrics was not the best strategy according to human annotators. These results highlight both the importance of search algorithms as well as the difficulty in evaluating neural dialogue systems in a realistic, full conversation setup.
We will make trained models, code and human evaluation transcripts publicly available. Randomly sampled transcripts for each strategy are available in 2 additional pages of examples. All transcripts are given in additional materials and we encourage everyone to read it.

Neural dialogue modeling
Since Vinyals and Le (2015), a neural autoregressive sequence model based on sequence-tosequence models ;  have become one of the most widely studied approaches to dialogue modeling (see, e.g., Serban et al., , 2017Zhao et al., 2017;Xu et al., 2017;Li et al., 2016Li et al., , 2017aZemlyanskiy and Sha, 2018;Zhang et al., 2018;Miller et al., 2017;Shen et al., 2018;Gu et al., 2018). In this approach, a neural sequence model is used to model a conditional distribution over responses given a context which consists of previous utterances by both itself and a partner in the conversation as well as any other information about the speaker.

Neural autoregressive sequence modeling
A neural autoregressive sequence model learns the conditional distribution over all possible responses given the context.
Each conditional distribution is modelled by a neural network, and popular choices include recurrent neural networks (Mikolov et al., 2010;, convolutional networks (Dauphin et al., 2016;Gehring et al., 2017) and selfattention (Sukhbaatar et al., 2015;Vaswani et al., 2017). We explore search strategies and fix the model to a recurrent neural network.
Learning: Maximum Likelihood Each example in a training set D consists of auxiliary information or context U (such as a persona profile or external knowledge context) and a sequence of utterances, each of which is marked with a speaker tag, i.e., where Y s l is the utterance from the l-th turn by a speaker s. The conditional log-probability assigned to this example given by a neural sequence model is then written as wheres = a if s = b and otherwises = b.
Learning maximizes the log-probabilities of all the conversations in the training set: often done using stochastic gradient descent with backpropagation (Rumelhart et al., 1985).

Inference (generation)
In this paper, we generate a response to the current state of the conversation (but do not attempt to plan ahead to future exchanges), maximizing p(Y |Y s <l , Ys <l , U ) = T t=1 log p(y t |y <t , Y s <l , Ys <l , U ).
Unfortunately, it is intractable to solve this problem due to the exponentially-growing space of all possible responses w.r.t. the maximum length T . It is thus necessary to resort to approximate search algorithms.
Greedy search Greedy search has been the search algorithm of choice among the recent papers on neural dialogue modeling (Gu et al., 2018;Zhao et al., 2017;Xu et al., 2017;Zhang et al., 2018). It moves from left to right selecting one token at a time, simply choosing the most likely token at the current time step: Greedy search has been found significantly suboptimal within the field of machine translation (see, e.g., Table 1 in , where similar neural sequence models are frequently used.
Beam search Instead of maintaining a single hypothesis at a time, as in greedy search above, at time step t beam search maintains a set of K hypotheses H t : . . , K} from H t is expanded with all possible next tokens v from the vocabulary V to form candidate hypotheses. Each candidate is in the form of and is assigned a score: The new hypothesis set of K hypotheses is then constructed as From the new hypothesis set, we find and copy finalized hypotheses (sequences ending with the special token eos for "end of sequence") to a candidate sequence set M t . That is, Beam search terminates when | ∪ t t =1 M t | ≥ K , where K is the maximum number of candidate sequences to be returned, or when t ≥ L max , where L max is the maximum length allowed for each candidate sequence. When terminated, beam search returns all the candidate sequences in M = ∪ t t =1 M t . One can increase the size of the subspace over which beam search searches for a response and size of M by changing hyper-parameters K, K , L max . However, beam search is known to suffer from the problem that most of the hypotheses discovered in M are near each other in the response space (Li et al., 2016(Li et al., , 2015. For tasks such as dialogue modeling, which are much more open-ended than e.g. machine translation, this is particularly troublesome as many high quality responses may be missing in the beam. Final sequence selection We consider search strategies to produce a set of candidate responses for the model to choose from. While greedy search provides only a single possible sequence, beam search generates a candidate set of size |M|. It is usual practice to use the score s(h) used during the search to select the final sequence, but it is an open question whether there are better selection strategies for choosing between these final candidate responses.
Avoiding repeating n-grams Although this has not been reported in a formal publication in the context of neural dialogue modeling to our knowledge, Paulus et al. (2017) and Klein et al. (2017) implement so-called n-gram blocking. In n-gram blocking, a hypothesis in a beam H t is discarded if there is an n-gram that appears more than once within it.

Uncovering hidden responses
We now propose an improved search strategy. To address the locality issue in beam search, we propose an iterative beam search to radically increase the size of search space without introducing much computational overhead, inspired by earlier work on diverse beam search (Vijayakumar et al., 2018;Batra et al., 2012;Li et al., 2016).

Iterative beam search
The search space over which beam search has operated can be characterized by the union of all partial hypothesis sets H t in Eq. (3): where we use the subscript 0 to indicate that beam search has been done without any other constraint. Re-running beam search with an increased beam width K would result in the search space that overlaps significantly with S 0 , and would not give us much of a benefit with respect to the increase in computation.
Instead, we keep the beam size K constant but run multiple iterations of beam search while ensuring that any previously explored spacē S <l = ∪ l−1 l =0 S l is not included in a subsequent iteration of beam search. This is done by setting the score of each candidate hypothesis s(h i t+1 ) in Eq. (5) to negative infinity, when this candidate is included inS <l . We relax this inclusion criterion by using a non-binary dissimilarity metric, and say that the candidate is included inS <l , if where ∆ is a string dissimilarity measure, such as Hamming distance used in this work, and is a similarity threshold. This procedure ensures that a new partial hypothesis set of beam search in the l-th iteration minimally overlaps with any part of the search space explored earlier during the first l − 1 iterations of beam search. By running this iteration multiple times, we end up with a set of top hypotheses from each iteration of beam search, from which the best one is selected according to for instance the log-probability assigned by the model. We build a final candidate set M as a set of all these best hypotheses from beam search iterations.
Practical implementation A major issue with iterative beam search in its naive form is that it requires running beam search multiple times, when even a single run of beam search can be prohibitively slow in an interactive environment, such as in dialogue generation. We address this computational issue by performing these many iterations of beam search in parallel simultaneously. At each time step in the search, we create sets of candidate hypotheses for all iterations in parallel, and go through these candidate sets in sequence from the (l = 0)-th iteration down to the last iteration, while eliminating those candidates that satisfy the criterion in Eq. (7). We justify this parallelized approach by defining the similarity measure ∆ to be always larger than the threshold when the previous hypothesis h is longer thanh i t+1 in Eq. (7).

Dialogue evaluation
Broadly there are two ways to evaluate a neural dialogue model. The first approach is to use a set of (often human generated) reference responses and compare a single generated response against them (Serban et al., 2015;Liu et al., 2016). There are several methods for this comparison: (1) measure the perplexity of reference responses using the neural dialogue model, (2) compute a string match-based metric of a generated response against reference responses, and (3) use human annotators to compare model generated responses against reference or other models' responses. None of these approaches capture the effectiveness of a neural sequence model in conducting a full conversation, because the model responses are computed given a human-written context, i.e., it does not see its own responses in the dialogue history, but gold responses only. We concentrate on a second approach for evaluation where a neural dialogue model has a multiturn conversation with a human partner (or annotator) (Zhang et al., 2018;Zemlyanskiy and Sha, 2018;Dinan et al., 2019). Unlike other approaches, it requires active human interaction, as a conversation almost always deviates from a previously collected data even with the same auxiliary information (U in Eq. (1)). This evaluation strategy reflects both how well a neural dialogue model generates a response given a correct context as well as how well it adapts to a dynamic conversation-the latter was not measured by the first strategy, where the model only had to generate a single response.

Human evaluation of a full conversation
An annotator is asked to make a conversation with a randomly selected model (search strategy) for at least five turns. At the end of the conversation, we ask the annotator three sets of questions: The first overall score allows us to draw a conclusion on which algorithm makes a better conversation overall. We use a 4 point scale in order to avoid having a "catch-all" category in the answer (Dalal et al., 2014). The latter two questions are collected to investigate the relationship between the overall impression and the quality of each utterance-pair.

Bayesian calibration
Although human evaluation is desirable, raw scores collected by annotators are difficult to use directly due to the annotator bias. Some are more generous while others are quite harsh, as recently reported in Zhang et al. (2018); Zemlyanskiy and Sha (2018). We propose using Bayesian inference as a framework to account for the bias of each annotator, and describe two instances of this framework below.
1-4 star rating of a conversation We treat both the unobserved score M i of each model, in our case each search algorithm, and the unobserved bias B j of each annotator as latent variables. The score of the i-th model follows the following distribution: µ i ∼ U(1, 4) and M i ∼ N (µ i , 1 2 ), where U and N are uniform and normal distributions. It states that a priori each model is likely to be uniformly good or bad. The annotator bias B j follows B j ∼ N (0, 1 2 ), where we are assuming that each annotator does not have any bias a priori.
Given the model score M i and annotator bias B j , the conditional distribution over an observed score S ij given by the j-th annotator to the i-th model is then: Due to the nature of human evaluation, only a few of S ij 's are observed. Figure 1 shows the graphical model described above.
The goal of inference in this case is to infer the posterior mean and the variance: where O is a set of observed scores.
Binary rating of an utterance When an annotator labels pairs of utterances from the conversation with a binary score {0, 1} (such as whether that pair was a "good" exchange), we need to further take into account the turn bias T k : T k ∼ N (0, 1 2 ). As we will use a Bernoulli distribution for each observed score rather than a 1-4 rating, we modify the prior of the model scores accordingly: The distribution of an observed utterance-pair score is then S ijk ∼ B(sigmoid(M i + B j + T k )), where B is a Bernoulli distribution. The goal of inference is then to compute which estimate the average number of positively labelled utterance-pairs given the i-th model and the uncertainty in this estimate, respectively.

Data: Persona-Chat
We use Persona-Chat, released recently by Zhang et al. (2018) and the main dataset for the Conversational Intelligence Challenge 2 (ConvAI2), 1 to train a neural dialogue model. The dataset contains dialogues between pairs of speakers randomly assigned personas from a set of 1,155, each consisting of 4-5 lines of description about the part they should play, e.g. "I have two dogs" or "I like taking trips to Mexico". The training set consists of 9,907 such dialogues where pairs of partners play their roles, and a validation set of 1,000 dialogues. The ConvAI2 test set has not been released. Each dialogue is tokenized into words, resulting in a vocabulary of 19,262 unique tokens. See Zhang et al. (2018) for more details.

Neural dialogue modeling
Model We closely follow  in building an attention-based neural autoregressive sequence model. The encoder has two bidirectional layers of 512 LSTM (Hochreiter and Schmidhuber, 1997) units each direction-layer, and the decoder has two layers of 512 LSTM units each. We use global general attention as described by Luong et al. (2015). We use the same word embedding matrix for both the encoder and decoder, which is initialized from 300-dimensional pretrained GloVe vectors (Pennington et al., 2014) for the 97% of the vocabulary which overlaps with GloVe. We allow word embedding weights to be updated during the training.
Learning We use Adam (Kingma and Ba, 2014) with the initial learning rate set to 0.001. We apply dropout (Srivastava et al., 2014) between the LSTM layers with the dropout rate of 0.5 to prevent overfitting. We train the neural dialogue model until it early-stops on the validation set. 2 The perplexity of trained model on the Con-vAI2 validation set is 24.84, which is competitive compared to the other entries on the competition's leaderboard. 3 Our model serves well as an underlying system for investigating the effect of search algorithms.

Search Strategies
We test three search strategies; greedy and beam search from §2.2, and iterative beam search (iterbeam) from §3.1.
Beam search (beam) uses beam size K = 5 and K = 15. This decision is based on preliminary experiments where we found that smaller beam sizes work better than larger ones do. We use the length penalty, described by Wu et al. (2016) and n-gram blocking from §2.2.
Iterative beam search (iter-beam) uses 15 iterations of beam search with beam size 5 resulting in a candidate set of size 15. We use the same length penalty and n-gram blocking as in beam search (beam). Given the hyper-parameters above both beam and iter-beam produce 15 candidates and selects the final response based on log-probability.
2 When the validation loss does not improve for twelve epochs, we early-stop.
3 https://github.com/DeepPavlov/convai/ blob/master/leaderboards.md Annotator ID Figure 2: The distribution of averaged overall scores given by annotators to greedy search (greedy). Each row plots scores given by a single annotator over multiple conversations. Counts show how many dialogues each annotator performed.

Evaluation
Human evaluation We use ParlAI (Miller et al., 2017) which provides seamless integration with Amazon Mechanical Turk (MTurk) for human evaluation. A human annotator is paired with a model with a specific search strategy, and both are randomly assigned personas out of a set of 1,155, and are asked to make a conversation of at least either five or six turns (randomly decided). We allow each annotator to participate in at most six conversations per search strategy and collect approximately 50 conversations per search strategy and additional human-human test. 4 Each conversation is given a single overall score and two sequences of binary utterance pairs flags, as described in §4.1.
Bayesian calibration In order to remove annotator bias, or inter-annotator variability, we use Bayesian calibration from §4.2. We take 50 warmup steps and collect 150 samples using NUTS sampler for inferring the posterior mean and variance of the model score in Eq. (8). We use 30 warm-up steps and 100 samples for inferring the mean and variance of the average portion of positively or negatively labelled utterance-pairs in Eq. (9). 5 Automatic metrics In addition to human evaluation, we compute automatic metrics to quantitatively characterize each search algorithm. First, we report the log-probability of a generated response assigned by the model which is a direct indicator of the quality of a search algorithm. Second, we compute the average number of unique ngrams generated per conversation normalized by the number of generated tokens in the conversation, called distinct-n from (Li et al., 2015), with n = 1, 2, 3.
We compute distinct-n in two different settings. First, we compute distinct-n over the candidate set M given by the search algorithm. Second, we compute distinct-n over the final selected responses for each search strategy. The former shows diversity within the possible response candidates, while the latter shows diversity among the actual selected dialogue outputs.

Human Evaluation
Annotator bias In Fig. 2, we plot the averaged scores provided by the human annotators for one search strategy (greedy), where each row corresponds to each annotator. Consider the three annotators with id 3, 4, 10. Their means are clearly separated from each other, which points to the existence of annotator bias. This observation supports the necessity of the Bayesian calibration described in §4.2.
Human evaluation In Table 1, we present the scores from human evaluation. In total 41 unique annotator participated within 201 collected conversations. We make a major observation which is that greedy search (greedy), which has been the search algorithm of choice in neural dialogue modeling, significantly lags behind the variants of beam search (beam, iter-beam) in all metrics. This stark difference is worth our attention, as this difference is solely due to the choice of a search algorithm and is not the result of different network architectures nor learning algorithms. In fact, this cannot even be attributed to different parameter initialization, as we use only one trained model for all of these results.
The model scores assigned to human conversations (humans) are far superior to all search strategies. It is clear with both overall score and utterance pairs proportion scores. This tells us that there are many open questions how to improve neural dialogue models.

Automatic Metrics
Search quality: log probability (log-p) Better search algorithms find responses with higher logprobability according to the model, as shown in Table 1. This is a natural consequence from exploring a larger subset of the search space.
A notable observation from Table 1 is that the neural sequence model assigns very low logprobabilities to human response. This implies that there is a limit to improving this specific neural dialogue model by using a better search algorithm and instead there is more room to improve the models and learning algorithms to place a high probability on human responses. It is necessary to test the proposed search strategies with new models and we leave this for the future.
Diversity: distinct-n The diversity metric is measured before (pre) and after (post) selecting the final response from the candidate set M for both beam search and iter-beam search. Since greedy and humans produce only single response, we compute the diversity metric only using those final responses for both greedy search and humans. In both pre and post settings, the normalization is done per each conversation.
As well as in human evaluation, greedy has lower diversity compared to all the other strategies as shown in Table 2. We see large gap in pre-selection distinct-n for all n between beam and iter-beam while the difference is small in post-selection distinct-n. In other words, while providing more diverse set of candidates, the final selected output response with iter-beam is not distinct-n ↑ Search n = 1 n = 2 n = 3 strategy post pre post pre post pre   particularly diverse. This agrees well with human evaluation, where both iter-beam and beam model scores were indistinguishable, as annotators could only see the final response after selecting from the candidate set. Table 3 shows preselection candidate sets for both beam search and iterative beam search. Finally, we observe a significant gap between the best search strategy and humans in these diversity metrics. Together with the gap we observed in human evaluation scores, we suspect that the lack of diversity in the output responses is a major factor behind the low performance of the tested neural dialogue model in the human evaluation.

Conclusion and Discussion
We have performed realistic human evaluation of the neural dialogue model to validate the importance of exploring better search strategies. We observed that careful design of human evaluation is necessary to properly evaluate the ability of the neural dialogue model to conduct a conversation with a human. The proposed Bayesian calibration of model scores helps to account the annotator bias observed in human evaluation.
Extensive analysis reveals that greedy search, which has been the inference algorithm of choice in neural dialogue modeling, significantly lags behind more sophisticated search strategies such as beam search and iterative beam search.
We have proposed the iterative beam search which produces more diverse set of candidate responses w.r.t. pre-selection distinct-n metric. Post-selection final responses with iterative beam search have higher log-probability compared to other search strategies. In spite of this, there is only a marginal difference between iterative beam search and beam search w.r.t. scores from human evaluation and post-selection distinct-n. This suggests that final response selection strategy is as important as the search strategy being used and can be a major factor in the inference pipeline of the neural dialogue model. We leave improving this strategy to future work. Finally, the model assigns lower probability to reference responses, which implies suboptimality in the current neural dialogue model . It is necessary in the future to test the proposed search strategies with new models.

A Selected dialogue transcripts from the evaluation
We publish dialogue transcripts from human evaluation for reader's analysis. We randomly select a transcript per each search strategy. Formatted representations are printed in tables from Table 4 to Table 6. The second row refers to personalized context from PersonaChat dataset (Zhang et al., 2018). The third row prints the whole dialogue where each turn is bounded with a box. Left column named Annotator contains responses written by the annotator. Middle column prints positional ordering of turns. Right column named Model contains responses generated by the model. The caption contains the search type and the score given by the annotator. We have prepared all evaluation scripts for reader in additional materials and we encourage everyone to read it.