Sparse Text Generation

Current state-of-the-art text generators build on powerful language models such as GPT-2, which have impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or ad-hoc truncation techniques, as in top-$k$ or nucleus sampling. This creates a mismatch between training and testing conditions. In this paper, we use the recently introduced entmax transformation to train and sample from a natively sparse language model, avoiding this mismatch. The result is a text generator with favorable performance in terms of fluency and consistency, fewer repetitions, and n-gram diversity closer to human text. In order to evaluate our model, we propose three new metrics that are tailored for comparing sparse or truncated distributions: $\epsilon$-perplexity, sparsemax score, and Jensen-Shannon divergence. Human-evaluated experiments in story completion and dialogue generation show that entmax sampling leads to more engaging and coherent stories and conversations.


Introduction
The last few years have seen a tremendous progress in neural language modeling, with a strong impact on key NLP tasks such as machine translation (Sutskever et al., 2014;Vaswani et al., 2017), speech recognition (Chorowski et al., 2015), summarization (Rush et al., 2015), and open-ended text generation (Radford et al., 2019). Benefitting from large amounts of data, models such as Transformer-XL (Dai et al., 2019) and GPT-2 (Radford et al., 2019) have achieved impressive perplexity scores on language modeling. However, the generated text is still often repetitive and incoherent (Table 1).
A conundrum with current approaches is the mismatch between training and testing conditions: models are usually trained to maximize the like-Context: "Are they ready to go live?" Zuko asked. "Yup, in just a minute" Tomi replied as he leaned back in his reclining chair. This was a moment that the entire galaxy would see. After 10 years...
Greedy: ... of study, the scientists had finally come to the conclusion that the universe was not a place for living. The scientists had been studying the universe for over a century. They had been studying the universe for over a century.
Top-k: ... of studying this phenomenon, the team had finally come up with a plan to study what the aliens were doing. The team would be looking for a way to create a space station. "And what would you call it?" "What would you call the space station?" Nucleus: ... of diligent testing, and some test run after that, the results were well received. "they're waiting for you, because there's no way they would." Stowedly at a local wedding place, the two archaeologists spoke together.
Entmax: ... of hibernation, Tomi was back. These last days had been a significant step forward in his mission. This time, Tomi was not alone. All the empires had aligned together and the world's leadership began to openly support his mission. lihood of observed text. However, when generating, state-of-the-art models sample from a truncated and renormalized softmax distribution (Fan et al., 2018;Holtzman et al., 2019). They do so as a compromise to avoid two extremes: a deterministic search for the most probable sentence (via greedy decoding or beam search) usually results in dull and repetitive "degenerate text" (Li et al., 2016a(Li et al., , 2017Holtzman et al., 2019). Stochastically sampling from the full softmax distribution, on the other hand, often generates many implausible words from the tail of the distribution (Fan et al., 2018). The recently proposed nucleus sampling approach (Holtzman et al., 2019) sets the truncation point based on the cumulative distribution function, i.e., considers the top words with a cumulative probability ≤ P . This approach permits better control of the number of words being generated than previous work, such as top−k sampling (Fan et al., 2018), which samples from the k most probable words. However, the "sparsity" introduced by both methods is artificially imposed at run time, not learned during training.
A second problem is that it is hard to compare different truncation strategies-for example, we cannot easily evaluate how the resulting truncated distributions behave as language models, because the most widely used metric for language modeling-perplexity-does not handle sparse distributions. For this reason, previous works (Welleck et al., 2019;Li et al., 2019) generate from a truncated softmax, but report the perplexity of the full softmax.
In this paper, we propose a new approachentmax sampling ( §3)-that eliminates the mismatch between training and test conditions, sparsity-wise. Key to our approach is the recently proposed entmax transformation (Peters et al., 2019). Entmax transforms a vector of scores into a sparse probability distribution, preventing implausible words from receiving any probability mass. Moreover, it does so natively: it comes with a well-defined loss function that allows it to learn its sparsity automatically from the data, during training. This results in a new stochastic text generator where the number of possible word types varies with the context (like nucleus sampling), but that generates by sampling directly from its output distribution (like softmax), and where the sparsity of this distribution is present during training (unlike any existing method).
As a second contribution, we propose three new metrics to support the evaluation of sparse language models ( §4): -perplexity, sparsemax score, and Jensen-Shannon divergence. We show that these metrics are well supported theoretically and can be used to compare our method with various truncation and temperature techniques.
Experiments in language modeling, story completion, and dialogue generation ( §5) show that entmax sampling generates more diverse text and fewer repetitions than nucleus and top-k sampling.

Related work
Decoding methods. While greedy decoding and beam search are popular strategies for sequenceto-sequence tasks, such as machine translation, Knowles et al. (2016) and Stahlberg and Byrne (2019) showed that searching for the most probable sentence in a model trained with likelihood maximization has a bias for short sentences. In open-ended generation, Fan et al. (2018) and Holtzman et al. (2018Holtzman et al. ( , 2019 have shown that these methods lead to repetitions and dull text. To overcome this, several authors proposed ways of altering the beam search method in order to promote word diversity (Li et al., 2016b;Vijayakumar et al., 2018;Kulikov et al., 2018).
An alternative to deterministic text generation is to sample directly from the softmax distribution. However, softmax is a strictly positive (dense) transformation. Since the probability mass tends to accumulate in a long tail, this procedure tends to generate very unlikely words too often, leading to degenerate text (Fan et al., 2018;Holtzman et al., 2019). This can be mitigated by lowering the softmax temperature (Ficler and Goldberg, 2017), by sampling from the top-k most probable words only (Fan et al., 2018;Radford et al., 2019), and through nucleus sampling (Holtzman et al., 2019).
Diversity-promoting models. In addition to new decoding methods, models that aim to increase word diversity and diminish repetition have also been introduced. Xu et al. (2018) proposed a diversity-promoting generative adversarial network, which rewards novel and fluent text. Holtzman et al. (2018) proposed augmenting the language model with several discriminators. More recently, Welleck et al. (2019) proposed augmenting log-likelihood loss with an unlikelihood loss that penalizes the generation of tokens that are present in the context. These alternatives can be applied jointly with entmax sampling.
Sparse transformations and losses. At the core of our work are sparse alternatives to the softmax transformation. Martins and Astudillo (2016) proposed sparsemax and applied it to multi-label classification. This was generalized by Peters et al.
(2019) via their α-entmax transformation, which was applied to sequence-to-sequence models for morphological inflection and machine translation. In contrast to our work, they performed deterministic decoding with beam search.
Evaluation metrics. The most common metrics to evaluate text generation models are perplexity (Jelinek et al., 1977) and BLEU (Papineni et al., 2002). For open generation, Zhu et al. (2018) observed that "no single metric is comprehensive enough." Other evaluations include corpus n-gram overlap (Yu et al., 2017;Press et al., 2017), Fréchet distance (Cífka et al., 2018), and sweeping the softmax temperatures to assess model robustness Caccia et al. (2018). These approaches are aimed at the (harder) problem of evaluating the quality of generated text. By contrast, our paper proposes new metrics for evaluating language models in the task of predicting the next word conditioned on ground truth context (like perplexity does), but supporting sparse probability distributions (which perplexity does not).

Language Modeling
Language models assign probability to word sequences x = START, x 1 , . . . , x T , STOP , where each x t is in a vocabulary V, and T ∈ N. This probability can be written as We would like the model θ to assign high probability to real sentences, i.e., each distribution p θ (· | x <t ) should assign a large probability value to the ground truth x t .
Given a set S of training sentences, the usual strategy for learning the language model parameters θ is to minimize the negative log-likelihood: The standard choice to model p θ (·|x <t ) in Eq. 1 is to compute a score vector z t by conditioning on the context x <t , and then applying a softmax transformation, At decoding time, the language model generates sentences one word at a time, by sampling from the learned probability distribution. However, softmax yields a dense distribution, i.e., some probability mass (even if small) is assigned to all the words in the vocabulary. Holtzman et al. (2019, §4) have shown that, if we sample from this distribution directly, the resulting text becomes degenerate, with common incoherences arising due to the unreliability of the tail of the distribution. This motivated a line of work proposing "ad-hoc" modifications to the softmax distribution, to reduce the effect of the tail. Two of the most successful techniques, top-k and nucleus sampling (Fan et al., 2018;Holtzman et al., 2019), do so by truncating and renormalizing the distribution p θ (·|x <t ). Note that these techniques are applied only at decoding time-during training the original softmax distribution is left untouched, being used as part of the optimization of the crossentropy loss.
Our alternative to these ad-hoc modifications builds on sparse transformations, as sparsemax (Martins and Astudillo, 2016) and, more generally, α-entmax (Peters et al., 2019). These transformations have the ability to inherently produce sparse probability distributions (i.e., their tails are short). Therefore, sampling from these distributions directly is a natural way to prevent degenerate text.
where α (z t , x) is the α-entmax loss: where p θ = α-entmax(z t ), and e x is the indicator one-hot vector that corresponds to the ground truth word x. When α = 1, we still recover the negative log-likelihood, α (z t , x) = − log p θ (x), and, when α = 2, this corresponds to the sparsemax loss (Martins and Astudillo, 2016), which we will revisit in §4. When using the α-entmax loss with α > 1, we can eliminate the mismatch between training and run time conditions, since the probability distributions evaluated by the loss are also sparse. The entmax loss belongs to the wider class of Fenchel-Young losses (Blondel et al., 2019) and, consequently, is convex on z, differentiable (with gradient ∇ z α (z, x) = −e x + p θ ), and for α > 1 has a separation margin: the loss becomes zero when the score of the correct class is separated by the rest by a margin of 1 α−1 . Separation is achieved if and only if p θ = e x , i.e., when the model puts all its probability mass in the correct word. This allows the model to be adaptive to the degree of uncertainty present: in some cases there are few plausible words, so most words should have probability zero, while in other cases a higher number of words are plausible and should be given probability mass.

Entmax Sampling
We propose entmax sampling as an alternative to top-k sampling and nucleus sampling (Holtzman et al., 2019). We sample from the categorical distribution obtained by applying the entmax transformation to the scores z t given by the model: We sample directly from the learned sparse probability distribution over the words, without any adhoc modification. Therefore, this sparsity is not artificially imposed at run time; rather, it is native to the entmax transformation and learned during training. As in nucleus sampling and in opposition to top-k sampling, entmax sampling considers a varying number of tokens depending on the context. Moreover, as we show in Table 3, with entmax sampling this variability is higher.

Evaluation Metrics
Language models are commonly evaluated by computing their perplexity (ppl) on held-out data. Perplexity assesses the ability of a language model to predict the next word given the context: However, its computation involves the logarithm of the predicted probability distribution over the words. This poses a problem when we are using sparse or truncated probability distributions, since we have lim p→0 log p = −∞. Usually, authors report the values for perplexity computed on the original probability distribution, before truncation. However, this metric does not allow different sparse decoding strategies to be compared. 3 As an alternative, we propose three different metrics (to facilitate better understand these metrics, comparative plots are shown in Fig. 2, App. B).
-perplexity. To be able to compute the perplexity for sparse distributions, the simplest approach is to smooth it by adding a small value to all terms followed by renormalization, as in additive (Laplace) smoothing (Chen and Goodman, 1999): (9) The value of can be tuned for each method. A disadvantage of -ppl is that it still does not evaluate the original sparse distribution, but rather a modified version of it. However, when applied to variants of truncated softmax, by collapsing all the truncated probabilities to the same value , it is useful to measure how much truncation deteriorates its ability to rank words, compared to softmax.
Sparsemax score. We can derive a more interesting metric that handles sparse distributions directly. By setting α = 2 in Eq. 6, 4 we obtain the sparsemax loss proposed by Martins and Astudillo where H 2 is the Gini entropy (see footnote 2). Unlike perplexity, this score is bounded. In fact, it is always between 0 (when p θ = e x with x = x) and 1 (when p θ = e x ). We prove this fact in App. A. Interestingly, when the model p θ is deterministic (e.g., when it comes from greedy search), we have H 2 (p) = 0, and the sparsemax score simply becomes the word accuracy. In the opposite case, when p θ is uniform, we obtain sp = Jensen-Shannon Divergence. Given two discrete probability distributions p θ and q, and denoting their mixture (arithmetic mean) as m := p θ +q 2 , and the Kullback-Leibler divergence as KL, the Jensen-Shannon divergence is defined as: The Jensen-Shannon divergence can be interpreted as a mutual information as follows (Grosse et al., 2002;Banerjee et al., 2005): consider a twostep process where we first toss a fair coin B ∼ Bernoulli( 1 2 ). If the outcome is heads, we sample the next word X according to the model p θ (·); if it is tails, we sample x ∼ q(·). A word generated according to this process is governed by the mixture m(·), x ∼ m(·). The Jensen-Shannon divergence between p θ and q is the mutual information between the random variables B and X, which is the conditional entropy. Hence, the Jensen-Shannon divergence can be seen as the reduction of uncertainty about the source B when we observe a sample x from the mixture m(·). The more similar the two distributions p θ and q are, the smaller this reduction is.
In our experiments, we report the JS as an evaluation metric for language models, setting q = e x (i.e., a one-hot distribution placed on the ground truth word x) and averaging the JS over the words. Like the sparsemax score described above, the JS is bounded: it is zero if p θ = e x , and maximal (log(2)) when p θ is a one-hot distribution placed on a different word.
Comparing multiple models. The generalized JS allows to compare two or more trained models: where p 1 , . . . , p K are the probability distributions of the different models and m = 1 K K k=1 p k is their mixture. This property can be useful for measuring the diversity between multiple models (e.g., when used in an ensemble system). We use this metric in App. D to rank the sentences in which the different models we compare disagree the most.

Experiments
We compare the different decoding methods in three NLP tasks: language modeling ( §5.1), story completion ( §5.2), and dialogue generation ( §5.3). In language modeling, we evaluate the model's fluency, while in story completion we also evaluate if the methods generate coherent and "interesting" text. In dialogue generation, we evaluate the methods' performance in an interactive task.

Language Modeling
Datasets and metrics. We performed experiments on three widely used language modeling datasets: WikiText-2 and WikiText-103 (Merity et al., 2016), andBookCorpus (Zhu et al., 2015). WikiText-2 and WikiText-103 are composed of Wikipedia articles, comprising around 2 and 100 million tokens for training, respectively. Their validation and test sets have 217,000 and 245,000 tokens, respectively. BookCorpus is composed of 11,038 freely available books. We used the standard split: 800 million tokens for training, 260,000 for validation, and 280,000 for testing.
We report the sparsemax score, Jensen-Shannon, and -perplexity ( §4) to evaluate the methods' fluency, and the REP and WREP 5  We treat as an hyperparameter for each model, tuned on the validation sets (we report the for each model in Table 8, App. C). For softmax, with and without decreased temperature, we set = 0.
Automatic Metrics Results. As can be seen in Table 2, entmax sampling consistently outperforms every other decoding method in sparsemax score and number of repetitions. Additionally, entmax sampling leads to better -perplexity scores than greedy decoding, top-k sampling, nucleus sampling, and softmax-τ , while having scores similar to softmax sampling. Concerning 6 We use the PyTorch re-implementation at https:// github.com/huggingface/transformers.  the Jensen-Shannon divergence scores, top-k sampling has the best scores. All this is achieved with a model that copes with sparsity inherently, instead of with truncations truncations at run time.

MEAN
In order to understand why entmax leads to better sparsemax scores and fewer repetitions, we show the mean, median, standard deviation, minimum, and maximum number of tokens each decoding strategy considers when predicting the next word given a context on the Wikitext-103 test set, in Table 3. We can see that entmax sampling and nucleus sampling consider a lot more tokens than greedy decoding and top-k sampling, which can be the reason for the smaller number of repetitions. This is possible since the number of tokens is variable, going from 1 to 12,305 for nucleus sampling and from 1 to 18,399 for entmax sampling. Moreover, a possible explanation for entmax sampling outperforming nucleus sampling is the fact that entmax' support size has higher variance; it is able to consider a higher number of tokens, while still being able to consider only one word. It is more adaptive to the context. 7

Story completion
Next, we analyze the model's ability to generate long sequences of text using different sampling methods. 8 We performed completion of stories from the WritingPrompts dataset (Fan et al., 2018), using the models fine-tuned on the Book-Corpus dataset. WritingPrompts consists of a collection of human-written stories paired with writing prompts. We randomly selected 1,000 stories which were at least 200 words long and used the first 50 words as context for the models. Examples of stories generated with each method (Table 1  and Table 11 of App. E) show that entmax sampling leads to more interesting stories while preventing degenerate text. To measure the stories' word diversity, we show in Figure 1 the distinctn metric 9 (Li et al., 2016a) for the stories generated by each model. It can be seen that entmax sampling leads to more diverse unique n-grams for n = {1, 2, 3, 4}, closer to human generated text. We also measured the number of unique words in the stories generated: entmax sampling generated 12,767 different words, while softmax with decreased temperature, greedy decoding, top-k, and nucleus sampling generated 9,973, 3,464, 7,852 and 11,929 words, respectively. As expected, entmax leads to higher word diversity on par with human stories, which contain 15,166 different words.
Human evaluation. We performed human evaluation of greedy decoding, top-k, nucleus, and entmax sampling on completion of stories from 8 Softmax sampling is not considered since it has been shown to generate degenerate text (Holtzman et al., 2019). 9 Distinct-n corresponds to the number of distinct n-grams divided by the total number of generated words.  the WritingPrompts datasets. To perform the human evaluation, we divided 100 stories into 2 sets of 50 stories. For each set of stories, 2 judges evaluated each story in 3 metrics: fluency (whether the text is syntactically and semantically correct), coherence (whether the story continuation is related to the context and is consistent), and engagement (whether the annotator felt interested in the story). Ratings were given on a 5-point scale, and the mean for each metric is reported in Table 4. Entmax sampling outperforms greedy decoding, top-k, and nucleus sampling on all three metrics. This confirms the better generation quality of entmax sampling.

Dialogue Generation
To evaluate the sampling methods in an interactive setting, we experiment with dialogue generation. Its goal is to generate an utterance, given a context consisting of the previous utterances in the dialogue and, in some cases, initial context sentences with related information that can be describing personas, knowledge, or scenarios.
Datasets and metrics. We performed experiments with the PersonaChat dataset (Zhang et al., 2018). It is a crowd-sourced dialogue dataset in which speakers were asked to condition their utterances on predefined personas. It contains 164,356 utterances over 10,981 dialogues. As there is no public test set, we report results on the validation set. We evaluate the word F 1 -score, -perplexity, sparsemax score, and Jensen-Shannon. As for the language modeling experiments, -perplexity, sparsemax score, and Jensen-Shannon are computed at the BPE level. We also report distinct-n metric for n = {1, 2} (Li et al., 2016a) and analyze how the models behave in dialogue simulations between two agents (Li et al., 2016c).   Automatic Metrics Results. We report the results in Table 5. Entmax again outperforms all the other sampling methods in sparsemax score and -perplexity, having the lowest JS (same as topk). Entmax also leads to fewer repetitions, having higher distinct-1 and distinct-2 scores. However, its F 1 score is lower (similar findings have been reported in Li et al. (2019)). This can be due to dialogue generation being an open-ended generation task that can have multiple correct answers.
Additionally, we simulated a conversation between two agents of the same model (Li et al., 2016c). We chose different personas randomly for the two agents. Then a first utterance from the Per-sonaChat dataset was given as context. We measured the average length of conversations, considering that the conversation is finished when utterances have an overlap of 80% or higher, when there is no response by an agent, or when it reaches 20 utterances (procedure similar to the one pro-   posed by Li et al. (2016c)). We also measured the number of unique words used, and the distinct-n metric for n = {1, 2}. As can be seen in Table 6, entmax sampling leads to longer conversations with higher word diversity and higher number of distinct 1-grams and 2-grams.
Human evaluation. Finally, we performed human evaluation following the ConvAI2 challenge: 6 volunteers had 30 conversations each with models using the different sampling methods. The volunteers scored the conversations from 1 to 5 in terms of fluency, consistency (whether the models utterances are coherent with their persona and the model does not contradict itself), and engagement. The models personas were randomly selected from the PersonaChat validation set. Results are reported in Table 7. Entmax sampling outperforms the other methods in consistency and engagement, having similar scores in fluency. This means entmax sampling does not only generate the most interesting conversation utterances, but it also leads to an improvement of the conversation consistency.

Conclusions
We proposed entmax sampling as a new strategy for generating text from a sparse probability distribution. It provides three main advantages: (i) offers a natural way of sampling directly from the output probability distribution; (ii) the distribution sparsity is present during training, and, consequently, there is no distribution sparsity mismatch between training and run time; (iii) when sampling with entmax, the number of words to be considered varies with the context, as in nucleus sampling and in contrast to top-k sampling. Additionally, we proposed new metrics for evaluating language models that produce sparse and truncated probability distributions: -perplexity, sparsemax score, and Jensen-Shannon divergence. Experiments show that entmax sampling leads to higher word diversity, fewer repetitions, and similar or improved results in automated metrics. Human evaluation confirms that entmax outperforms greedy decoding, top-k, and nucleus sampling in coherence/consistency and engagement, and is similar or better in terms of fluency.

Supplementary Material
A Proof of boundedness of the sparsemax score We show here that the sparsemax score in Eq. 10 is always bounded between 0 and 1. The fact that sp ≤ 1 simply follows from the fact (Blondel et al., 2019, Prop. 2) that any Fenchel-Young loss (which includes 2 (z, x)) is non-negative. Since sp = 1 − min{ 2 (z, x) | sparsemax(z) = p θ }, it follows that sp ≤ 1. Let us see when the maximal value 1 is attained. We have: Since the Gini entropy is maximized by the uniform distribution, the maximum distribution in Eq. 13 is of the form p θ = 1 − t, t |V|−1 , . . . , t |V|−1 for t ∈ [0, 1]. Replacing in Eq. 13, we obtain This is maximized by t = 0, which corresponds to p θ = e x . To see that we always have sp ≥ 0, we use the fact that the Gini entropy H 2 (p θ ) is always nonnegative (zero if and only if p θ is a one-hot distribution), which is clear from the definition in footnote 2, and that p(x) ≥ 0; therefore, the sum of these two terms is also non-negative, and zero if and only if p θ = e x with x = x. Figure 2 shows comparative plots of theperplexity, sparsemax score, and Jensen-Shannon divergence, for a distribution of the form p θ = 1 − t, t |V|−1 , . . . , t |V|−1 , varying t, with a vocabulary of 100 words.

C Values of used for -perplexity
The values of used for the computation of the -perplexity are reported in Table 8.

D Comparison of models with the Jensen-Shannon divergence
We compared the distributions given by the different decoding methods when generating the sentences of the BookCorpus validation set with the Jensen-Shannon divergence, as described in §4. In Tables 9 and 10 we show some of the sentences with higher Jensen-Shannon divergence, as well as the probability given by each model to the ground truth word.

E Story completion examples
Examples of story completion with context from the WritingPrompts dataset using the different models fine-tuned on BookCorpus are presented in Table 11.

F Dialogue simulation examples
In order to evaluate the different decoding methods in dialogue generation, we simulated a conversation between two agents using the same decoding method. For that, we assigned each agent a randomly selected persona from the PersonaChat dataset and gave a first utterance of the dataset as context. Examples of dialogue simulations are presented in Tables 12-13 and Tables 14-15.   Context: I wake up, battered and scarred from the beat I was given behind the burger joint. I think the name was Ronnys? My memory of yesterday, better yet the last week has been completely faded. It was looking into the fog of a ...
Greedy: ... memory, and was beginning to wonder if the man was still alive. the fog was beginning to swirl, and the fog was beginning to swirl again. the fog was beginning to swirl again, and the fog was beginning to swirl again.
Top-k: ... moment of silence. "how 'd you know?" "i'm a professor, but i can't find it." "and you're just a professor, i guess." "and you 're just a professor, too." "i 'm a professor?" "yes. and you are." "i didn't mean to scare you." Nucleus: ... match to find out how far its web is. ain't a pretty good option, huh? paul shrugged. wow, that's how much access i've got, i'm sure. ill check with them if i need to go outside in the cold. i'm sure they'll do it. fine by me.
Entmax: ... memory. my brain has become trapped in a cycle of confusion that can't be understood with common knowledge. things become warped and fake, the knowledge getting heavier and more complex. move. do something. i can't.
Context: "We are gathered here, in the sight of the Gods and in the presence of friends and family, to observe the passing of Mr. John Carlyle." The soft woosh of the beginning flames punctuates the Speaker's introduction. John was ... Greedy: ... a former professor of physics at the university of texas. he was a graduate student at the university of texas. he was a member of the faculty of the university of texas. he was a member of the faculty of the university of texas.
Top-k: ... a very nice guy, who always liked to make fun of herself, even after her husband had died and she found him dead. she never wanted to be his wife, but he did have a strong reputation as a good friend, so she liked to make Nucleus: ... a good man. he'd built a home with plans for much larger quantities and quickly noticed that consuming supplies did not always lead to immediate success. on each occasion he filled his stomach with an island of straw he had Entmax: ... very interesting. he was quite quintessential romantic: blond, warm, driven; amazingly handsome; amazing with his mother; and calm; always entertaining. his childhood spent in the neighborhood of narrow north kennedy    Context: greetings from the big apple, kim here, how are you?
Top-k: -i'm very good, thank you, i hope you have a great day -thanks. have a good day too.
-same to you, have a good day too! -you do the same. have a great day too.
-the same to you and have a great day also! -i do have a very nice day, i love running.
-that's nice to enjoy, it s a great day here today -i am a political junkie.
-you must be very busy! -what do you do for a living? -i'm actually a business major.
-that must be hard to do, do you like politics? -not a big fan of politics, i like to go skiing and listen to rock -do you play sports or something? -i am in school for business -oh that is awesome. i used to play football in college, but i hated it! -oh that's why i play football!! -wow. what is your name? -my name is jed. i am from russia. how about you? -i played in ny and played at boston.
Nucleus: -hi, i'm fantastic, how are you? -great. making some barbies for our barbie girls.
-awesome, do you have any hobbies? -i really enjoy barbies. i live a long time away.
-i am looking forward to going to school in the fall.
-school is cool. do you sing? i love barbie songs.
-i've a pet pean i sing while eating tacos.
-oh yea, i have a small heart lol -are you looking forward to winter? i would like to live in the big apple.
-definitely -winter is fun! no thanks, me neither.
-you must really like summer, too lol -thanks. it is too cold to sing in the winter. its just not for me.
-do you like tacos? -oh my gosh. they're good. my favorite is barbie. lol -i know, but spaghetti is my fave lol -haha. lol -oh , i do enjoy spaghetti , for my college graduation i was in last year of high school -how are you ?
Entmax: -good. i just rang someone on the other side.
-good, what did you get excited about today? -i love paris. how was your day? -its been crazy, i'm always feeling excited! lol -what is your favorite thing to do? -oh i love going hiking in the mountains, myself.
-that sound amazing. i like travelling.
-love trips, but i cannot stand staying in one place all day. lol -do you have any hobbies ? i always want to find cool new things.
-i really like going out and nature itself, i prefer hiking -yes, exploring parks and all that jazz when i can.
-that is awesome fun, whats your fav color? -i love grey. roses and the mountains signify my youth.
-mine is blue, it makes me think of blueberries though -grey denotes youth well or openness and transparency. love the kale chips.
-mmm i love chocolate . lol -oh i am sold on chocolate. eating it off the cob -haha -i miss the crazy curly hair fries and crackers . haha Table 15: Example of dialogue simulation between two agents using the different decoding methods.