Automatic Evaluation of Neural Personality-based Chatbots

Stylistic variation is critical to render the utterances generated by conversational agents natural and engaging. In this paper, we focus on sequence-to-sequence models for open-domain dialogue response generation and propose a new method to evaluate the extent to which such models are able to generate responses that reflect different personality traits.


Introduction
The advent of deep learning methods has led to the development of data-driven conversational agents for informal open-domain dialogue (see Serban et al., 2016b, for a review). These chatbot systems model conversation as a sequence-tosequence (SEQ2SEQ) problem (Sutskever et al., 2014) and rely on large amounts of unannotated dialogue data for training. We investigate whether such models are able to generate responses that reflect different personality traits. We test two kinds of models: The speaker-based model by Li et al. (2016b), where response generation is conditioned on the individual speaker, and a personality-based model similar to Herzig et al. (2017), where generation is conditioned on a personality type.
Evaluating the output of chatbot systems is remarkably difficult (Liu et al., 2016). To make progress in this direction with regards to personality aspects, we propose a new statistical evaluation method that leverages an existing personality recogniser (Mairesse et al., 2007), thus avoiding the need for specialised corpora or manual annotations. We adopt the Big Five psychological model of personality (Norman, 1963), also called OCEAN for the initials of the five personality traits considered: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroti-cism. Each of the traits is represented by a scalar value on a scale from 1 to 7.
In the remainder of the paper, we introduce the models we examine and describe our new evaluation method. Our results show that the models are able to generate output that reflects distinct personalities, over a baseline encoding chance personality variation. We conclude with a brief discussion on related work.

Dialogue Generation Models
The generation models we make use of are standard SEQ2SEQ models consisting of an encoder LSTM, an attention mechanism, and a decoder LSTM (Sutskever et al., 2014;Bahdanau et al., 2015). The model processes context-response pairs, where the context X = x 1 , x 2 , . . . , x m corresponds to the latest utterance(s) in the dialogue and the response Y = y 1 , y 2 , . . . , y n is the utterance to be generated next. The probability of the response Y given the context X is predicted as: p(Y |X) = n t=1 p(y t |y 1 , . . . , y t−1 , X) The attention mechanism by Yao et al. (2015) is used over the hidden states of the encoder LSTM to generate a context vector c t that determines the relative importance of the words in the context utterance at each decoding step t. Then the probability of each word w k (k ∈ |V |, where V is the vocabulary) to be the next word at step t is predicted with a softmax function: where h t is the hidden state of the decoder LSTM and f is an activation function. The weights of matrix W ∈ R |V |×d are learned during training, with d being the number of hidden cells. Both the Speaker Model and the Personality Model we describe below include 4-layer LSTMs with 1024 hidden cells per layer.

Speaker Model
Our starting point is the persona-based model by Li et al. (2016b). 1 In this model, each speaker is associated with an embedding v v v s learned during training. Whenever a response by speaker s is encountered during training, the corresponding embedding v v v s is inserted into the first hidden layer of the decoder LSTM at each time step (i.e., conditioning each word in the utterance). The hidden states h t of the decoder LSTM is thus calculated as follows (where y * t is the embedding of the response word at time t, and g stands for the LSTM cell operations): Li et al. (2016b) evaluated their model regarding individual content (factual) consistency. Our goal is to evaluate whether the model preserves individual stylistic aspects related to personality traits.

Personality Model
We modify the Speaker Model to allow for the generation of responses reflecting different personality types. To this end, instead of leveraging speaker embeddings, we estimate the OCEAN scores for each speaker and insert a personality embedding v v v o into the first layer of the LSTM decoder. 2 OCEAN scores are 5-dimensional vectors o, where each dimension ranges from 1 to 7. We normalise them to the range [−1, 1] and then embed them with a linear layer: is learned during training, thus learning relationships between OCEAN trait values and properties of the utterances. Whenever a response with personality traits o is encountered, we insert v v v o into the first hidden layer of the decoder LSTM. Thus, the hidden states h t are now calculated as: This version of the model is similar to Herzig et al. (2017). 3 The authors focus on the customer service domain and evaluate the model output's style 1 See http://github.com/jiweil/Neural-Dialogue-Generation. We reimplemented the model in PyTorch. 2 The procedure for assigning OCEAN scores to a given speaker is explained in the next section. 3 Our personality model is a modified version of our reimplementation of the code by Li et al. (2016b) (see footnote 1). The code by Herzig et al. (2017) is not readily available. for only two personality traits with human evaluation. In contrast, we deal with open-domain chat and assess all OCEAN traits globally, using the automatic method we describe Section 4.

Dataset
We use transcripts from two American situation comedy television series: Friends 4 and The Big Bang Theory. 5 We consider only those characters who contribute a minimum of 2000 turns, which results in 13 characters (6 from Friends and 7 from The Big Bang Theory). We assign a unique speaker id to each character. In addition, we estimate the personality of each character as follows: for each character, we randomly select 50 samples of 500 utterances each, and estimate the OCEAN scores for each sample using the personality recogniser by Mairesse et al. (2007), which exploits linguistic features from 'Linguistic Inquiry and Word Count' (Pennebaker and King, 1999) and the MRC Psycholinguistic database (Coltheart, 1981). 6 We assign to each character the OCEAN score resulting from taking the arithmetic mean of the estimated scores for the corresponding 50 samples.
We consider every two consecutive turns in a scene to be a context-response pair and annotate each response with either the speaker id or the speaker's OCEAN scores. The resulting dataset contains ∼86k context-response pairs, of which around 2000 pairs were randomly selected and reserved for validation.

Training
Given the relatively small size of the TV-series dataset, following Li et al. (2016b) we use the OpenSubtitles dataset (Tiedemann, 2009) to pretrain the model. OpenSubtitles is a large opendomain repository containing over 50M lines from movie subtitles. Since this data does not include information on which character is the speaker of each line, we simply take each two consecutive lines to be a context-response pair. Due to limitations regarding computational power, we lever-age only a subset of the dataset: ∼1.8M pairs for training and ∼75k pairs for validation.
We train a standard SEQ2SEQ model for 15 iterations on the OpenSubtitles training set, until perplexity becomes stable in the validation set. We then initialise the Speaker and Personality models using the parameters learned with OpenSubtitles and train them on the TV-series training set for 30 more iterations, until the perplexity in the corresponding validation set stabilises.
We use the same settings as Li et al. (2016b) for training: We set the batch size to 128, the learning rate to 1.0 (halved after the 6th iteration), the threshold for clipping gradients to 5, and the dropout rate to 0.2. Vocabulary size is 25, 000 and the maximum length of an input sentence is 50. All parameters (including the speaker embeddings in the Speaker Model) are initialised sampling from the uniform distribution on [−0.1, 0.1].

Testing
For testing, we again leverage OpenSubtitles to extract a large subset of ∼2.5M utterances not present in the training or validation sets. Using each of the utterances in this set as context, we let the trained Speaker and Personality models generate responses for the 13 characters, with Stochastic Greedy Sampling (Li et al., 2017). Since general responses are a known problem in neural response generation chatbots (Sordoni et al., 2015;Serban et al., 2016a;Li et al., 2016a;Zhang et al., 2018) and our goal is to focus on personalityrelated stylistic differences, we remove the most frequent 100 responses common to all characters/personalities. After this cleaning step, we end up with ∼700k responses per character/personality. We refer to the clean set of generated responses as the evaluation set.

Evaluation Method
We propose a new evaluation method to measure whether persona-based neural dialogue generation models are able to produce responses with distinguishable personality traits for different characters and different personality types.
Using the evaluation set, for each character we randomly select 250 samples of 500 responses and calculate the OCEAN scores for each sample. Recall that the OCEAN scores correspond to 5-dimensional vectors. We label each of these 250 vectors with the corresponding character. This gives us 13 gold classes-one for each characterwith 250 datapoints each. We then use a support vector machine classifier 7 to test to what extent the OCEAN scores estimated from the generated responses allow us to recover the gold character classes. We compute results using 5-fold crossvalidation (training on 80% of the set and testing on the remaining 20% once for each fold). We report average scores over ten iterations of this procedure (i.e., 5 × 10).
We consider a baseline obtained by randomising the gold character label in the set of generated responses, which indicates the level of performance we may expect by chance. In addition, we use the procedure described above to discriminate between characters using their original (gold) utterances from the transcripts, rather than modelgenerated responses. This serves as a sanity check for the personality recogniser used to estimate the OCEAN scores-if the recogniser cannot detect personality differences among the characters in the original transcripts, it is not reasonable to expect that the models will be able to generate responses with different personality styles-and provides an upper bound for the performance we can expect to achieve when evaluating generated responses.
Given that the particular personality recogniser we use (Mairesse et al., 2007) was not optimised for dialogues from TV-series transcripts, as an additional sanity check we compare its performance on the original (gold) utterances with a bag-ofwords (BoW) approach. This allows us to test whether the recogniser may only be detecting trivial patterns of word usage. 8 We select the top 200 most frequent words over the original utterances as features, without removing words typically considered stop words such as pronouns or discourse markers, since they may be personality indicators. Then we run the same classification procedure using these BoW representations.

Results
In Table 1, we report average F1 score per character (including precision and recall) for the Speaker 7 We use the SVM implementation in Python's scikit-learn library with radial basis function kernel. We tune the regularisation parameter C and use default settings for all other parameters. We tried a range of different algorithms, including k-means and agglomerative clustering as well as a multi-layer perceptron classifier, always obtaining the same trends in the results. 8 We thank one of the anonymous reviewers for suggesting this additional test. and the Personality models, as well as the baseline and gold data. The results for these four conditions are all statistically significantly different from each other. 9

Lower and Upper Bounds
The first thing to note is that the results on the gold transcripts are higher than the baseline, reaching 61% F1 score on Friends and 69% on The Big Bang Theory. This indicates that the evaluation method is able to distinguish between the different personalities in the data reasonably well. Apparently, The Big Bang Theory characters are more distinct from each other than those in Friends.
When we use the BoW approach on the gold transcripts instead of the representations by the personality recogniser, we obtain significantly lower results: 23% F1 score on Friends and 19% on The Big Bang Theory. 10 The personality recogniser thus detects patterns that go beyond what can be captured with BoW representations.

Speaker and Personality Models
We find that the responses generated by the Speaker model display consistent personality variation above baseline, although a significant level of the personality markers found in the original data seems to be lost (32% vs. 61% and 47% vs. 69%). The results obtained for the Personality model are significantly above baseline as well (23% vs. 16% and 30% vs. 15%). We also see that the personality traits found in the responses generated by the Personality model yield lower distinguishability than those by the Speaker model. This is to be expected, since the Personality model generates responses for a personality type, which 9 Significance is tested with a two-independent-sample ttest on the results of 10 iterations, first using Levene's test to assess the equality of variances and then applying Welch's or Student's t-test accordingly. 10 We also run this experiment removing stop words (using the list of English stop words from scikit-learn), obtaining almost identical results: 22% F1 score on Friends and 18% on The Big Bang Theory.
should be more varied (and hence less distinguishable) than those by an individual speaker.
An advantage of the Personality model, however, is that in principle it allows us to generate responses for novel, predefined personalities that have not been seen during training. To test this potential, we create five extreme personality types by setting up the score of one of the OCEAN traits to a high value (6.5) and all remaining four traits to an average value (3.5). We then let the model generate responses to all the utterances in the evaluation set for each of these extreme personalities and evaluate the extent to which the responses differ in style following the same procedure as in the previous experiment.   In recent years, there has been a surge of work on modelling different stylistic aspects, such as politeness and formality, in Natural Language Generation with deep learning methods (among others, Sennrich et al., 2016;Hu et al., 2017;Ficler and Goldberg, 2017;Niu and Bansal, 2018). Regarding generation in dialogue systems, besides the two response generation models we have tested, other recent approaches to opendomain dialogue have considered stylistic aspects. For example,  leverage metadata about speakers' personal information, such as age and gender, to condition generation using domain adaptation methods; while Luan et al. (2017) use multi-task learning to incorporate an autoencoder that learns the speaker's language style from nonconversational data such as blog posts. The output of these models could also be assessed for personality differences using our method.
More recently, Oraby et al. (2018) have used the statistical rule-based generator PERSONAGE (Mairesse and Walker, 2010) to create a synthetic corpus with personality variation within the restaurant domain. They use the data to train and evaluate neural generation models that produce linguistic output given a dialogue act and a set of semantic slots, plus different degrees of personality information, and show that the generated output correlates reasonably well with the synthetic data generated by PERSONAGE. Our work differs from Oraby et al. (2018) in several respects: We focus on open-domain chit-chat dialogue, where the input to the model is surface text (rather than semantic representations such as dialogue acts) from naturally occurring dialogue data. Rather than relying on parallel data with systematic personality variation, we exploit a personality recogniser. In this respect, our approach has some similarities to Niu and Bansal (2018), who use a politeness classifier for stylistic dialogue generation. Here we have used the personality recogniser by Mairesse et al. (2007), which may not be ideal as it was originally trained on snippets of conversations combined with stream of consciousness essays. Our method, however, is not tied to this particular recogniser-any other personality recogniser that produces numerical scores may be used instead.
We think that the automatic evaluation method we have proposed can be a useful complement to qualitative human evaluation of chatbot models.
Our study shows that the models under investigation produce output that retains some stylistic features related to personality, and can learn surface patterns that generalise beyond the training data.