Controllable Text Generation with Focused Variation

This work introduces Focused-Variation Network (FVN), a novel model to control language generation. The main problems in previous controlled language generation models range from the difficulty of generating text according to the given attributes, to the lack of diversity of the generated texts. FVN addresses these issues by learning disjoint discrete latent spaces for each attribute inside codebooks, which allows for both controllability and diversity, while at the same time generating fluent text. We evaluate FVN on two text generation datasets with annotated content and style, and show state-of-the-art performance as assessed by automatic and human evaluations.


Introduction
Recent developments in language modeling Dai et al., 2019;Radford et al., 2018;Holtzman et al., 2020;Khandelwal et al., 2020) make it possible to generate fluent and mostly coherent text. Despite the quality of the samples, regular language models cannot be conditioned to generate language depending on attributes. Conditional language models have been developed to solve this problem, with methods that either train models given predetermined attributes (Shirish Keskar et al., 2019), use conditional generative models (Kikuchi et al., 2014;Ficler and Goldberg, 2017), fine-tune models using reinforcement learning (Ziegler et al., 2019), or modify the text on the fly during generation (Dathathri et al., 2020).
As many researchers noted, injecting style into natural language generation can increase the naturalness and human-likeness of text by including pragmatic markers, characteristic of oral language (Biber, 1991;Paiva and Evans, 2004;Mairesse and Walker, 2007). Text generation with * Work done while at Uber AI Labs. style-variation has been explored as a special case of conditional language generation that aims to map attributes such as the informational content (usually structured data representing meaning like frames with keys and values) and the style (such as personality and politeness) into one of many natural language realisations that conveys them (Novikova et al., 2016(Novikova et al., , 2017Wang et al., 2018). As the examples in Table 1 show, for one given content frame there can be multiple realisations.When a style (a personality trait in this case) is injected, the text is adapted to that style (words in red) while conveying the correct informational content (words in blue). A key challenge is to generate text that respects the specified attributes while at the same time generating diverse outputs, as most existing methods fail to correctly generate text according to given attributes or exhibit a lack of diversity among different samples, leading to dull and repetitive expressions.
Conditional VAEs (CVAE) (Sohn et al., 2015) and their variants have been adopted for the task and are able to generate diverse texts, but they suffer from posterior collapse and do not strictly follow the given attributes because their latent space is pushed towards being a Gaussian distribution irrespective of the different disjoint attributes, conflating the given content and style.
An ideal model would learn a separate latent space that focuses on each attribute independently. For this purpose, we introduce a novel natural language generator called Focused-Variation Network (FVN) 1 . FVN extends the Vector-Quantised VAE (VQ-VAE) (van den Oord et al., 2017), which is non-conditional, to allow conditioning on attributes (content and style). Specifically, FVN: (1) models two disjoint codebooks for content and style respectively that memorize input text vari-  ations; (2) further controls the conveyance of attributes by using content and style specific encoders and decoders; (3) computes disjoint latent space distributions that are conditional on the content and style respectively, which allows to sample latent representations in a focused way at prediction time. This choice ultimately helps both attribute conveyance and variability. As a result, FVN can preserve the diversity found in training examples as opposed to previous methods that tend to cancel out diverse examples. FVN's disjoint modeling of content and style increases the conveyance of the generated text, while at the same time generating more natural and fluent text. We tested FVN on two datasets, PersonageNLG (Oraby et al., 2018) andE2E (Dušek et al., 2020) that consist of content-utterance pairs with personality labels in the first case, and the experimental results show that it outperforms previous state-ofthe-art methods. A human evaluation further confirms that the naturalness and conveyance of FVN generated text is comparable to ground truth data.

Related Work
Our work is related to CVAE based text generation (Bowman et al., 2016;Shen et al., 2018;, where the goal is to control a given attribute of the output text (for example, style) by providing it as additional input to a regular VAE. For instance, the controlled text generation method proposed by Hu et al. (2017) extends VAE and focuses on controlling attributes of the generated text like sentiment and style. Differently from ours, this method does not focus on generating text from content meaning representation (CMR) or on diversity of the generated text. (Song et al., 2019) use a memory augmented CVAE to control for persona, but with no control over the content.
The works of (Oraby et al., 2018;Harrison et al., 2019;Oraby et al., 2019) on style-variation generators adopt sequence-to-sequence based models and use human-engineered features (Juraska and Walker, 2018) (e.g. personality parameters or syntax features) as extra inputs alongside the content and style to control the generation and enhance text variation. However, using human-engineered features is labor-intensive and, as it is not possible to consider all possible feature combinations, performance can be sub-optimal. In our work we instead rely on codebooks to memorize textual variations.
There is a variety of works that address the problem of incorporating knowledge or structured data into the generated text (for example, entities retrieved from a knowledge base) (Ye et al., 2020), or that try generate text that is in line with some given story (Rashkin et al., 2020). None of these works focuses specifically on generating text that conveys content while at the same time controlling style. Last, there are works such as (Rashkin et al., 2018) that focus on generating text consistent with an emotion (aiming to create an empathetic agent) without, however, directly controlling the content.

Methodology
Our proposed FVN architecture (Figure 1) has the goal to generate diverse texts that respect every attribute provided as controlling factor. We describe a specific instantiation where the attributes Figure 1: Focused-Variation Network (FVN) has four encoders (text-to-content encoder, text-to-style encoder, content encoder and style encoder), two codebooks (for content and style), and one text decoder. The training data contains ground-truth text with associated content and style. The text decoder uses v C and v S , latent vectors of content and style, as well as the latent vectors e C k and e S n from codebooks (the nearest to the z C and z S vectors produced by the text-to-content and text-to-style encoders) to generate text back. To further control the content and style of the generated text, we feed the o L output vectors of the generated text t to text encoders (content and style). o L are aligned to a word embedding codebook.
are content (a frame CMR containing slots keys and values) and style (personality traits). However, the same architecture can be used with additional attributes and / or with different types of content attributes (structured data tables and knowledge graphs for instance) and style attributes (linguistic register, readability, and many others). To encourage conveyance of the generated texts, FVN learns disjoint discrete content-and style-focused representation codebooks inspired by VQ-VAE as extra information along with the representations of intended content and style, which avoids the posterior collapse problem of VAEs.
During training, FVN receives as input an intended content c and style s as well as a reference text t. The reference text is passed through two encoders (text-to-content and text-to-style), while content and style are encoded with a content encoder and a style encoder. The text-to-content encoder maps input text t into a content latent vector z C and the text-to-style encoder maps the input text t into a latent style vector z S . The closest vectors to z C and z S from the content codebook e C and style codebook e S , e C k and e S n , are selected. The content encoder encodes the intended content frame into a latent vector v C and the style encoder encodes the intended style into a latent vector v S . A text decoder then receives e C k , e S n , v C and v S and generates the output text t . The generated text is subsequently fed to a content and a style decoder that predict the intended content and style.
At prediction time (Figure 2), only content c and style s are given, and in order to obtain e C k and e S n without an input text, we A) collect a distribution over the codebook indices by counting, for each training datapoint containing a specific value for c and s, the amount of times a specific index is used, and B) sample e C k and e S n from these frequency distributions. These disjoint distributions allow the model to focus on specific content and style by using them for conditioning and the sampling allows for variation, hence the name of focused variation. v C and v S obtained from the content and style encoders and the sampled e C k and e S n are provided to the text generator that generates t .
The rest of this section will detail each component and the training and prediction processes.

Encoding and Codebooks
As shown in Figure 1, FVN uses four encoders and one decoder during training: the text-to-content encoder Enc TC (·), the text-to-style encoder Enc TS (·), the content encoder Enc C (·), the style encoder Enc S (·), and the text decoder Dec(·).
Text-to-* encoders The text-to-content encoder Enc TC (·) encodes a text t to a dense representation z C ∈ R D while the text-to-style encoder Enc TS (·) encodes a text t to a dense representation z S ∈ R D : z C = Enc TC (t) and z S = Enc TS (t).
In order to learn disjoint latent spaces for the different attributes we want to model, we train two codebooks, one for content e C ∈ R K×D and one for style e S ∈ R N ×D . They are shown as [e C 1 , . . . , e C K ] and [e S 1 , ..., e S N ] in Figure 1. These two codebooks are used to memorize the latent vectors for text-to-content variation and textto-style variation learned during training. Instead of using the z C and z S vectors as inputs to the de-coder, we find their nearest latent vectors in the codebooks e C k and e S n and use those nearest latent vectors for decoding the text instead of the original encoded dense representation. Formally, k = argmin i z C −e C i 2 and n = argmin j z S −e S j 2 . Like in VQ-VAE, we use the l2-norm error to move the latent vectors in the codebooks e towards the same space of the encoder outputs z: where sg(·) stands for the stop gradient operator.
Style and content encoders The content encoder encodes a CMR c treating it as a sequence of tokens and producing a matrix V C ∈ R L ×D , where L is the length of c, from which the last element v C ∈ R D is returned. The style encoder encodes a style s and obtains a dense representation v S ∈ R D selecting the last element of the matrix V S ∈ R L ×D . Ultimately, v C = Enc M (m) and v S = Enc S (s).
Both sets of vectors, e and v are needed as the former learn to memorize the encoded inputs z, while the latter learn regularities in the attributes.

Text Decoder
The decoder takes the e C k , e S n , v C and v S , which encode content and style, as input and decodes text t . We use an LSTM network to model our decoder and provide the initial hidden state h 0 and initial cell state c 0 . The initial hidden state is the concatenation of e C k and e S n , while the initial cell state is the concatenation of v C and v S : When we decode the l-th word, we encode the previous word t l−1 and pay attention to the encoded sequence of content v C and style v S using the last hidden state as a query. Since both content and style are sequences of words, the attention mechanism can help figure out which part of them is important for decoding the current word. We concatenate the embedded previous output word and the attention output as the input for LSTM x l . The LSTM updates the hidden state, cell state and produces an output vector g l ∈ R 2D . Since we want to feed the generated text back to text encoders for additional control, we reduce g l to a word embedding dimension vector o l by a linear transformation. Finally, we map o l to the size of the vocabulary and apply softmax to obtain a probability distribution over the vocabulary.
The loss for text decoding is the sum of cross entropy loss of each word os L Dec = − l log P (t l ).

Content and Style Decoders
To ensure the generated text t conveys the correct content and style, we feed them to content and style decoders to perform backward prediction tasks that better control the generator. The decoders contain two components: we first reuse the text-to-content and text-to-style encoders to encode the embedded predicted text o L and obtain latent representations z C and z S , and then we classify them to predict content c and style s , as shown in the right side of Figure 1: z C = Enc TC (o L ) and z S = Enc TS (o L ). Enc TC (·) and Enc TS (·) denote the same text-to-content and text-to-style encoders we defined previously. This design is inspired by work on text style transfer (dos Santos et al., 2018).
Both z vectors and e vectors are used by two classification heads F C (multi-label) and F S (multi-class) for predicting content and style respectively in order to force those vectors to encode attribute information. We use g to denote the g-th element in the set of possible key-value pairs in the CMR and m(·) to represent an indicator function that returns whether the g-th element is in the ground-truth CMR.
The loss for training the two prediction heads is: Finally, we also adopt vector quantization by mapping each generated word's representation o l to the word embedding e V ∈ R |V |×D to map the output of the decoder and the input of text encoders in the same space. This is needed because the textto-* encoders expect as input text embedded using Figure 2: At prediction time we encode c and d with encoders to obtain v C and v S and we select e C k by sampling k ∼ P (K|C = c) and e S n by sampling n ∼ P (N |S = s). Those four vectors are provided as input to the text decoder to generate text. word embeddings, but in this case we are providing o L as input, and without this vector quantization loss, o L will not be in the same space of the embeddings. As a result, there is another VQ loss: The total loss minimized during training is the sum of the losses for decoding the text, predicting the content and style, the VQ-loss from two codebooks, and the VQ-loss for word embedding:

Prediction
The whole prediction process is depicted in Figure  2. The trained text decoder expect four inputs: v C , v S , e C k , and e S n . At prediction time, only content c and style s are given. We can obtain v C , v S by providing c and s to their respective encoder, but we also need to obtain e C k and e S n without input text. At the end of the training phase, we map each content c ∈ C and style s ∈ S to the indices in the e C and e S codebooks by first obtaining z S and z C vectors from the training data associated with c and s, we find the index of the closest codebooks vectors by argmin k e C k − z C 2 and argmin n e S n − z S 2 and count how many times each index k ∈ K was the closes to each c ∈ C and likewise for indices n ∈ N for each s ∈ S. By normalizing the counts, we obtain a distribution P (K|C) for content and a distribution P (N |S) for style. The construction of the two distributions is performed only once at the end of the training phase.
To obtain e C k at prediction time, we select the k vector of the codebook by sampling k ∼ P (K|C = c) and likewise to obtain e S n with n ∼ P (N |S = s). Sampling from those distri-  butions allows to both focus on specific content and style disjointly by conditioning on them, while at the same time allowing variability because of the sampling (we refer to this procedure as focused variation). v C , v S , e C k , and e S n are finally provided as inputs to the decoder to generate the text t . Content and style decoders mentioned in the training section are not needed for prediction.

Experiments
To test the capability of FVN to generate diverse texts that convey the content while adopting a certain style, we use the PersonageNLG text generation dataset for dialogue systems that contains CMR and style annotations. To test if FVN can convey the content (both slots and values) correctly on an open vocabulary, with complex syntactic structures and diverse discourse phenomena, we use the End-2-End Challenge dataset (E2E), a text generation dataset for dialogue systems that is annotated with CMR.  (Dušek et al., 2018), we delexicalized only 'name' and 'near' keeping the remaining slots' values. Since the E2E dataset does not have style annotations but has lexicalized texts, we model the CMR in the same way we did for PersonageNLG, but we replace the style codebook with a slot-value codebook that help the text decoder generating the slot values in the CMR. We build the focused variation distribution for every slot-value independently over the codebook indices, e.g. P (N |s = P riceRange[high])P (N |s = F oodT ype[F rench]).... During prediction we sample codes for each slot value in the CMR and use their average to condition text decoding. This is particularly useful when the surface forms in the output text are not the slot values themselves, e.g. when "PriceRange[high]" should be generated as "expensive" rather than "high".

Datasets and Baselines
We use NLTK (Bird et al., 2009) to tokenize each sentence and de-lexicalize the text as described in (Dušek and Jurcicek, 2016a). We use 300-dimensional GloVe embeddings (Pennington et al., 2014) trained on 840B words. Words not in GloVe are initialized as the averaged embeddings of all other embeddings plus a small amount of random noise to make them different from each other. The details of each module in the FVN are listed in Table 2. We set D = 300, K = 512, N = |V |. The encoders are three-layer stacked Bi-LSTM and the text decoder is one-layer LSTM. The style/slotvalue codebook is initialized as pre-trained word embedding. The content codebook is uniformed initialized in the range of [−1/K, 1/K]. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001 for minimizing the total loss. More dataset details are shown in Appendix A Table 13 and Table 14.
We compare our proposed model against the best performing models in both datasets. All of them are sequence-to-sequence based models. For Per-sonageNLG, TOKEN, CONTEXT (from (Oraby et al., 2018)) are variants of the TGEN (Novikova et al., 2017)  The results of the baselines (Oraby et al., 2018;Harrison et al., 2019) are taken from their original papers, but it's unclear if they were evaluated using a single or multiple references (for this reason they are marked with †), but since these models are not dependent on sampling from a latent space, we would not expect that to change performance.
We also compare to conditional VAEs: CVAE implements the conditional VAE (Sohn et al., 2015) framework. Controlled CVAE implements the controlled text generation (Hu et al., 2017) framework. The architecture and hyper-parameters of CVAE and controlled CVAE are the same as FVN.
The FVN ablations used in our evaluation are: (1) FVN-ED does not use the codebooks, only uses the content and style encoders and decoders, and is equivalent to an attention-augmented sequenceto-sequence model; (2) FVN-VQ does not use the content and style encoders and decoders, it directly uses the sampled latent vector for text decoding (3.3); (3) FVN-EVQ does not use content and style decoders; (4) FVN is the full network. Refer to Table 2 for architecture details. All VAEs and FVN variants are evaluated using multiple references because the sampling from latent space may lead to generate a valid and fluent text that n-gram overlap metrics would not score high when evaluated against a single reference.

Automatic Evaluation
We evaluate the quality and diversity of the generated text on both dataset. PersonageNLG is styleannotated and delexicalized, so we also report style and content correctness for it.
To evaluate quality in the generated text, we use the automatic evaluation from the E2E generation challenge, which reports BLEU (n-gram precision) (Papineni et al., 2002), NIST (weighted n-gram pre-         , 2005), and ROUGE (n-gram recall) (Lin, 2004) scores using up to 9-grams. To evaluate content correctness, we report micro precision, recall, and F 1 score of slot special tokens in the generated text, with respect to the slots in the given CMR c. To evaluate diversity, we report the distinct n-grams of groundtruth and baselines' examples. For style evaluation, we separately train a personality classifier (with GloVe embeddings, 3 bi-directional LSTM layers, 2 feed-forward linear layers) on the PersonageNLG training data. The macro precision, recall, and F 1 score of the personality classifier on the test set is 0.996. We use this classifier to evaluate the style of the generated text and report our results in Table 5.

PersonageNLG Human Evaluation
In addition to automatic evaluation, we conducted a crowdsourced evaluation to compare our model against the ground truth on the entire test set. We did not compare our model with baselines since a pilot evaluation on a random sample of 100 data points from the test set suggested that baselines did not produce fluent enough text to compare with FVN. We considered the ground truth to be a performance upper bound and compared against it to find how close FVN is to it. Crowdworkers were pre-  sented with a personality and two sentences (one is ground truth and the other one was generated by FVN) in random order, and were asked evaluate A) the fluency of the sentences in a scale from 1 to 3 and B) which of the two sentences was most likely to be uttered by a person with a given personality (more details in Appendix C). This evaluation was conducted on the entire test set consisting of 1,390 data points, 278 per personality, and each data point was judged by three different crowdworkers. We report the result of Question A in Table 9. For each sentence, we averaged the scores across three judges. The overall performance of FVN is very close to the ground truth (2.81 vs. 2.9), which suggests that FVN can generate text of comparable fluency with respect to ground truth texts.
We evaluated Question B using a majority vote of the three crowdworkers. Considering the overall performance, 50.29% of times human evaluators considered FVN generated text equal or better at conveying personality than the ground truth. This suggests that FVN can generate text with comparable conveyance with respect to ground truth.
More details and a full breakdown on the human evaluation are available in Appendix C.

Results and Analysis
Tables 3, 4 and 5 show the results on text quality, content correctness, and style. As shown in Table 3, FVN significantly outperforms the state-ofthe-art methods (context-m), especially on BLEU and NIST, which evaluate the precision of generated text, with the caveat regarding single or multiple references explained above. We believe this is due to the fact that FVN explicitly models CMR and style, while context-m depends on human-engineered features. Comparing FVN with CVAE and controlled CVAE, which are similar methods that also sample from the latent space, FVN performs better on all the metrics. Human evaluation results in Section 4.3 show that FVN is close to the ground truth in fluency and style.
Regarding the content correctness evaluation in Table 4, FVN overall performs much better than other baselines, especially on the recall score. Methods with explicit control decoders (controlled CVAE and FVN) perform better than CVAE and FVN-EVQ, which suggests that the controlling module is useful to enhance the content conveyance. Regarding the style evaluation in Table 5, all methods have good performance. Style is likely easy to convey in the text (the markers are pretty specific) and easy to identify for the separately trained personality classifier. Nevertheless, FVN is the best performing model. The text diversity comparison in Table 8 shows how FVN and its ablations have a diversity of generated texts with respect to the ground truth texts, but so do VAE-based methods. The combination of these findings suggests that FVN can produce text with comparable or better diversity than VAEs and ground truth, while conveying content and style more accurately.
Comparing with the ablations, the full FVN always performs better than FVN-ED and FVN-VQ, especially on the recall of slot tokens. FVN-VQ is able to precisely generate slot tokens from the CMR, but it cannot generate all required slot tokens, while FVN can generate them with high precision and substantially higher recall. An explanation is that the latent vectors in the content codebook only memorize the representations of texts without generalizing properly to new CMRs: since FVN is able to generate text containing most of the required slots, that text is usually longer than FVN-VQ's, which also explains why FVN performs better than FVN-VQ on METEOR and ROUGE-L that evaluate the recall of n-grams, and suggests that all encoders and codebooks are indeed needed for obtaining high performance.
The comparison between FVN and FVN-EVQ shows how in some cases FVN-EVQ has higher quality, but FVN obtains better scores on correctness and style, suggesting the additional decoder improves conveyance sacrificing some fluency.
In Table 7, we compare our proposed model and variants against the best performing models in the E2E challenge: TGEN (Novikova et al., 2017), SLUG (Juraska et al., 2018), and Thomson Reuters NLG (Davoodi et al., 2018;Smiley et al., 2018). extravert Name EatType Food PriceRange CustomerRating Area FamilyFriendly Near same e C k let 's see what we can find on Name SLOT . yeah, it is FamilyFriendly SLOT with a CustomerRating SLOT rating , it is a EatType SLOT , it is a Food SLOT place in Area SLOT , it is pricerange SLOT near Near SLOT . different e S n i do n't know . Name SLOT is a EatType SLOT with a CustomerRating SLOT rating , also it is a FamilyFriendly SLOT , Area SLOT , and it is a Food SLOT place near Near SLOT , also it has a price range of pricerange SLOT . Name SLOT is a EatType SLOT , it is a FamilyFriendly SLOT , it 's a Food SLOT place , it is near Near SLOT , it has a CustomerRating SLOT rating , you know pal! it is in Area SLOT and has a price range of pricerange SLOT . different e C k Name SLOT is a EatType SLOT with a CustomerRating SLOT rating , also it is a Food SLOT place , you know ! and it is Area SLOT , also it is FamilyFriendly SLOT near Near SLOT , also it has a price range of pricerange SLOT . same e S n Name SLOT is a EatType SLOT , it is a FamilyFriendly SLOT , it 's a Food SLOT place , it is near Near SLOT , it has a CustomerRating SLOT rating , you know and it is in Area SLOT and pricerange SLOT . Name SLOT is a EatType SLOT , it is a Food SLOT place , it is FamilyFriendly SLOT , it 's in Area SLOT , it is near Near SLOT , it has a CustomerRating SLOT rating and a price range of pricerange SLOT, you know! . Table 11: Diversity in FVN-generated PersonageNLG examples. Given the CMR and style the the generated text varies depending on the vector sampled from the codebook.
agreeable "let 's see what we can find on" "well , i see" "did you say ?" "i suppose" "right" "okay ?" " you see ?" "it is somewhat" disagreeable "oh god i mean , everybody knows" "oh god" "i do n't know ." "i am not sure ." conscientious "let 's see what we can find on" "well , i see" "did you say " " sort of " "you see ?" "let 's see, " "..." unconscientious "oh god i , i do n't know ." "darn" " i mean ." "i ... i , i do n't know ." "i mean , i am not sure ." "damn" "!" "it has like a " extravert "oh god i am not sure ." "let 's see ," "..." "alright ?" "yeah" "i do n't know" "did you say ?" "you know !" "you know and" "pal" "! " We can see from the results that FVN performs better than all these state-of-the-art models. The reason of the low performance of CVAE-based methods on the E2E dataset is that the CMR are disjoint in the train and test sets (while in PersonageNLG they are overlapping) and CVAEs struggle to handle unseen CMRs. FVNs performs well because it builds focused variations for each attribute independently instead of the entire CMR. Table 11 shows texts generated by FVN under the same CMR (given 8 attributes, rare in training data) and extravert style. The first three samples have the same CMR latent vector, but different sampled style latent vectors. The remaining three examples have different sampled CMR latent vectors, but the same style latent vector. In the first three examples, the generated texts and the words representing the extravert style are different ("let 's see what we can", "I don't know", "you know"). In the latter three examples, the words representing style are similar ("you know"), but the aggregation of attributes is different. These examples suggest that the two codebooks learn disjoint information and that the sampling mechanism introduces the desired variation in the generated texts. Table 11 shows that FVN learns disjoint content and style codebooks and that the vectors in the codebook can be explicitly interpreted by sampling multiple texts and observing the generated patterns. This is useful because, beyond sampling correct style vectors, we can select the realization of a style we prefer (Table 12 shows linguistic patterns associated with the top codes of each style). These patterns are automatically learnt and suggest that there is no need to encode them with manual features. Conditional VAEs do not provide this capability.
Samples obtained providing the same CMR and style to different models and examples of the linguistic patterns learned by FVN's style codebook are provided in Appendix D. Diverse samples obtained from FVN by sampling different latent codes are shown in Table 15.

Conclusion
In this paper, we studied the task of controlling language generation, with a specific emphasis on content conveyance and style variation. We introduced FVN, a novel model that overcomes the limitations of previous models, namely lack of conveyance and lack of diversity of the generated text, by adopting disjoint discrete latent spaces for each of the desired attributes. Our experimental results show that FVN achieves state-of-the-art performance on Per-sonageNLG and E2E datasets and generated texts are comparable to ground truth ones according to human evaluators.   https://github.com/ neural-dialogue-metrics Table 13 shows details of the PersonageNLG dataset, while Table 14 shows details of the E2E dataset.

B Baselines details
The first three baselines are taken from (Oraby et al., 2018) and adopt the TGen architecture (Dušek and Jurcicek, 2016b), an encoder-decoder network, with different kinds of input.
TOKEN adds a token of additional supervision to encode personality. Unlike other works that use a single token to control the generator's output (Hu et al., 2017), the personality token encodes a several different parameters that define style.
CONTEXT introduces a context vector that explicitly encodes a set of 36 manually-defined style parameters encoded as a vector of binary values. We then apply these style encoding approaches to three state of the art models taken from (Harrison et al., 2019), which extend (Oraby et al., 2018) by changing the basic encoder-decoder network to OpenNMT-py (Klein et al., 2017) in the following ways.
m1 inserts style information into the sequence of tokens that constitute the content c; m2 incorporates style information in the content encoding process by concatenating style representation with content representation before passing it to the content encoder; m3 incorporates style information into the generation process by additional inputs to the decoder. At each decoding step, style representation is concatenated with each word's embedding and passed as input to the decoder.
token-m means that style (personality here) is encoded with a single token; context-m means that style is encoded via the 36 parameters.
TGEN (Novikova et al., 2017) adopts a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input CMR.
SLUG (Juraska et al., 2018) adopts seq2seqbased ensemble which uses LSTM/CNN as the encoders and LSTM as the decoder); heuristic slot aligner reranking and data augmentation. Both TGEN and SLUG use partial ('name' and 'near' slot) de-lexicalized texts . Thomson

C Human evaluation details
Crowdworkers were presented with a personality and two sentences (one is ground truth and the other one was generated by our model) in random order, and were asked to answer the following two questions: • Question A: On a scale of 1-3, how grammatical or natural is this sentence? (please answer for both sentences).
• Question B: Which of these two sentences do you think would be more likely to be said by a(n) person? (Fill in with the personality given, e.g. agreeable) Answers: Sentence 1, 2, equally Question A asked the crowdworkers to assess the degree of grammaticality / naturalness of a sentence while Question B was designed to evaluate which of the two sentences exhibits a specific personality. -use the top frequent code of each value-The Phoenix is a french pub near Crowne Plaza Hotel in the city centre . It is not children friendly and has a price range of more than £30 and has a customer rating of 5 out of 5 .
-use same area[city centre] code, other values' codes are sampled-The Phoenix is a pub in the city centre . It is a french food . It is located in the city centre . The Phoenix is a pub in the city centre . It is a french food . It is a high price range and is not child friendly .
-use same customer rating[5 out of 5] code, other values' codes are sampled-The Phoenix is a french pub located in the city centre . It is a high customer rating and is not children friendly . The Phoenix is a pub in the city centre near Crowne Plaza Hotel . It is a high customer rating and is not children friendly .
-use same eattype [pub] code, other values' codes are sampled-The Phoenix is a french pub near Crowne Plaza Hotel in the city centre . It is not children friendly and has a price range of more than £30 . The Phoenix is a french pub in the city centre near Crowne Plaza Hotel . It is not child friendly and has a high price range and a customer rating of 5 out of 5 .
-use same familyFriendly[no] code, other values' codes are sampled-The Phoenix is a french pub located in the city centre near Crowne Plaza Hotel . It is not family-friendly and has a customer rating of 5 out of 5 . The Phoenix is a french pub located in the city centre . It is not family-friendly and has a customer rating of 5 out of 5 .
-use same food [French] code, other values' codes are sampled-The Phoenix is a french pub in the city centre . It is a high customer rating and is not children friendly . The Phoenix is a french pub located in the city centre . It is not family-friendly .
-use same priceRange[more than £30] code, other values' codes are sampled-The Phoenix is a french pub in the city centre . It is not children friendly and has a price range of more than £30 . The Phoenix is a french pub near Crowne Plaza Hotel in the city centre . It is not children friendly and has a price range of more than £30 . We report the result of Question A in Table 9. For each sentence, we averaged the scores across three judges, and conducted a paired t-test between the ground truth and our model for each personality. The result shows that the FVN sentences were considered significantly more grammatical / natural on conscientiousness and disagreeableness, the ground truth sentences were better on agreeable and unconscientiousness, and no difference was found for extravert. The overall performance of FVN is very close to the ground truth (2.81 vs. 2.9), which suggests that FVN can generate text of comparable fluency with respect to ground truth texts.
We evaluated Question B using a majority vote of the three crowdworkers. Table 10 shows the percentage frequency distribution for each personality and the entire test set. We found that our FVN model performs better than the ground truth on agreeable and conscientiousness, while the ground truth is better for the rest of the three personalities. Specifically, 53% and 67% of the time, the crowdworkers judge the agreeable and conscientious sentences generated by our model to be better than the ground truth sentences. This finding is surprising, since we consider the ground truth be an upper bound in this task, and our model outperforms it two out of five personalities. One possible explanation about why FVN only performs better on agreeable and conscientiousness is that the language patterns of agreeableness and conscientiousness are more systematic and thus easier to learn by the model. In Table 10 we also report a column that shows the percentage frequency of text where the judgment was equal or in favor of FVN. Underlined rows show when the number of equal judgments or judgments favorable to FVN exceeds the judgments that preferred the ground truth text. Considering the overall performance, 50.29% of times human evaluators considered FVN generated text equal or better at conveying personality than the ground truth. This finding suggests that FVN can generate text with comparable conveyance with respect to ground truth texts. Table 15 shows generated examples from FVN trained on E2E. Given a CMR, we sample a code for each slot value. The first part shows the generated text using the most frequent code for each slot value. We can see that the text is fluent and conveys the CMR precisely. In the remaining part, we keep one slot-value's code fixed and the remaining slot codes are sampled. The fixed slot-value is present, but some of the other slot-values are missing in the generated text. One explanation is that in the training data the text associated with a CMR can also contain missing values and therefore the codebook memorizes this behavior.