Steering Output Style and Topic in Neural Response Generation

We propose simple and flexible training and decoding methods for influencing output style and topic in neural encoder-decoder based language generation. This capability is desirable in a variety of applications, including conversational systems, where successful agents need to produce language in a specific style and generate responses steered by a human puppeteer or external knowledge. We decompose the neural generation process into empirically easier sub-problems: a faithfulness model and a decoding method based on selective-sampling. We also describe training and sampling algorithms that bias the generation process with a specific language style restriction, or a topic restriction. Human evaluation results show that our proposed methods are able to to restrict style and topic without degrading output quality in conversational tasks.


Introduction
Neural encoder-decoder models have demonstrated great promise in many sequence generation tasks, including neural machine translation (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2014;Luong et al., 2015;Wu et al., 2016), image captioning (Xu et al., 2015), summarization (Rush et al., 2015;Gu et al., 2016;Kikuchi et al., 2016), and conversation generation (Vinyals and Le, 2015;Sordoni et al., 2015;Shang et al., 2015;Shao et al., 2017;Li et al., 2015).These encouraging early successes have motivated research interest in training more natural-sounding conversational systems based on large volumes of open-domain human-to-human interactions.In order to create more human-like patterns of conversation, the agents need to have recognizable (and tunable) style, just as individual humans do, and also need to accept guidance from separate information processing modules in order to increase quality of responses.In an extreme case, an agent may be micro-managed by a human user who uses the neural model to enforce grammar and style (e.g., a level of politeness, or a type of humor), while driving the content directly (e.g., by expressing In this manner, the neural model becomes an authoring tool, rather than an independent chat-bot.On the other hand, in fully automated agent systems, the agent may be influenced by a knowledge database, or some other artificial information system, while running in a pre-set style or a style deemed best based on the course of the conversation. One obstacle to achieving this with neural language generation models is that the sentence representation is distributed across all coordinates of arXiv:1709.03010v1[cs.CL] 9 Sep 2017 the embedding vector in a way that is hard to disentangle, and thus control.In order to gain insight into the full distribution of what a decoder might produce given the prompt sentence as input, the model has to be heavily (and sometimes cleverly) sampled.The second problem is that neural models only become highly functional after training with very large amounts of data, while the strongly recognizable style usually must be defined by a relatively tiny corpus of examples (e.g., all Seinfeld episodes, or all popular song lyrics).
In this paper, we address the challenge of how to enforce the decoder models to mimic a specific language style with only thousands of target sentences, as well as generating specific content in that style.We developed and experimented with several training and decoding procedures to allow the model to adapt to target language style and follow additional content guidance.Our experiments, conducted on an open-domain corpus of Twitter conversations and small persona corpora, show that our methods are capable of responding to queries in a transferred style without significant loss of relevance, and can respond within a specific topic as restricted by a human.Some examples of 'scenting' the base conversation model with particular styles are shown in Table 1.More can be found in the Supplementary Material.

Related Work
Recurrent neural network based encoder-decoder models have been applied to machine translation and quickly achieved state-of-the-art results (Bahdanau et al., 2014;Luong et al., 2015).As an extension, the attention mechanism enables the decoder to revisit the input sequence's hidden states and dynamically collects information needed for each decoding step.Specifically, our conversation model is established based on a combination of the models of (Bahdanau et al., 2014) and (Luong et al., 2015) that we found to be effective.In section 3, we describe the attention-based neural encoder-decoder model we used in detail.
This work follows the line of research initiated by (Ritter et al., 2011) and (Vinyals and Le, 2015) who treat generation of conversational dialog as a data-drive statistical machine translation (SMT) problem.Sordoni et al. (2015) extended (Ritter et al., 2011) by re-scoring SMT outputs using a neural encoder-decoder model conditioned on conversation history.Recently, researchers have used neural encoder-decoder models to directly generate responses in an end-to-end fashion without relying on SMT phrase tables (Vinyals and Le, 2015;Sordoni et al., 2015;Shang et al., 2015;Shao et al., 2017;Li et al., 2015).Li et al. (2016) defined a "persona" as the character that an artificial agent, as actor, plays or performs during conversational interactions.Their dataset requires user identification for all speakers in the training set, while our methods treat the base data (millions of twitter conversations) as unlabeled, and the target persona is defined simply by a relatively small sample of their speech.In this sense, the persona can be any set of text data.In our experiments, for example, we used a generic Star Wars character that was based on the entire set of Star Wars scripts (in addition to 46 million base conversations from Twitter).This provides us with a system that can talk about almost anything, being able to respond to most prompts, but in a recognizable Star Wars style.Other possibilities include training (styling) on famous personalities, or certain types of poetry, or song lyrics, or even mixing styles by providing two or more datasets for styling.Thus our targets are highly recognizable styles, and use of these for emphasis (or caricature) by human puppeteers who can choose from multiple options and guide neural models in a direction they like.We expect that these tools might not only be useful in conversational systems, but could also be popular in social media for text authoring that goes well beyond spelling/grammar auto correction.

Neural Encoder-Decoder Background
In general, neural encoder-decoder models aim at generating a target sequence Y = y 1 , . . ., y Ty given a source sequence X = (x 1 , . . ., x Tx ).Each word in both source and target sentences, x t or y t , belongs to the source vocabulary V x , and the target vocabulary V y respectively.
First, an encoder converts the source sequence X into a set of context vectors C = {h 1 , h 2 , . . ., h Tx }, whose size varies with regard to the length of the source passage.This context representation is generated using a multi-layered recurrent neural network (RNN).The encoder RNN reads the source passage from the first token until the last one, where Here E x ∈ R |Vx|×d is an embedding matrix containing vector representations of words, and Ψ is a recurrent activation unit that we employ in the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997).
The decoder, which is also implemented as an RNN, generates one word at a time, based on the context vector set returned by the encoder.The decoder's hidden state ht is a fixed-length continuous vector that is updated in the same way as encoder.At each time step t in the decoder, a time-dependent attentional context vector c t is computed based on the current hidden state of the decoder ht and the whole context set C.
Decoding starts by computing the contentbased score of each context vector as: e t,i = h t W a h i .This relevance score measures how helpful the i-th context vector of the source sequence is in predicting next word based on the decoder's current hidden state h t .Relevance scores are further normalized by the softmax function: , and we call α t,i the attention weight.The time-dependent context vector c t is then the weighted sum of the context vectors with their attention weights from above: c t = Tx i=1 α t,i h i .With the context vector c t and the hidden state h t , we then combine the information from both vectors to produce an attentional hidden state as follow: z t = tanh(W c [c t ; h t ]).The probability distribution for the next target symbol is computed by p(y t = k|ỹ <t , X) ∝ exp(W s z t + b t ).

Decoding with Selective Sampling
The standard objective function for neural encoder-decoder models is the log-likelihood of target T given source S, which at test time yields the statistical decision problem: (1) However, as discussed in (Li et al., 2015;Shao et al., 2017), simply conducting beam search over the above objective will tend to generate generic and safe responses that lack diversity, such as "I am not sure".In section 7.3, we present a ranking experiment in which we verify that an RNN-based neural decoder provides a poor approximation of the above conditional probability, and instead biases towards the target language model p(T ).Fortunately, the backward model p(S|T ) empirically perform much better than p(T |S) on the relevance ranking task.Therefore, we directly apply Bayes' rule to Equation 1, as in statistical machine translation (Brown et al., 1993), and use: Since p(T |S) is empirically biased towards p(T ), in practice, this objective also resembles the Maximum Mutual Information (MMI) objective function in (Li et al., 2015).
The challenge now is to develop an effective search algorithm for a target words sequence that maximize the product in Equation 2. Here, we follow a similar process as in (Wen et al., 2015) which generates multiple target hypotheses with stochastic sampling based on p(T |S), and then ranks them with the objective function 2 above.However, as also observed by (Shao et al., 2017), step-by-step naive sampling can accumulate errors as the sequence gets longer.
To reduce language errors of stochastic sampling, we introduce a sample selector to choose the next token among N stochastically sampled tokens based on the predicted output word distributions.The sample selector, which is a multilayer perceptron in our experiments, takes the following features: 1) the log-probability of current sample word in p(w t |S); 2) the entropy of current predicted word distribution, wt P (w t |S) log P (w t |S) for all w t in the vocabulary; 3) the log-probability of current sample word in p(w t |∅), which we found effective in ranking task.The selector outputs a binary variable that indicates whether the current sample should be accepted or rejected.
At test time, if none of the N sampled tokens are above the classification threshold, we choose the highest scored token.If there are more than 1 acceptable samples among N stochastically sampled tokens, we randomly choose one among them.Ideally, this permits us to safely inject diversity while maintaining language fluency.We also use the sample acceptor's probabilities as the language model score P (T ) for objective in equation 2.
As regards directly integrating beam-search, we found (a) that beam-search often produces a set of similar top-N candidates, and (b) that decoding with only the objective p(Y |X) can easily lead to irrelevant candidates.(See section 7.3) Therefore, we use the selective-sampling method to generate candidates for all our experiments; this (a) samples stochastically then (b) selects using a learned objective from data.The sample-then-select ap-proach encourages more diversity (v.s.MMI's beam-search) while still maintain language fluency (v.s.naive-sampling).
5 Output style restriction using a small 'scenting' dataset In this section, we propose three simple yet effective methods of influencing the language style of the output in the neural encoder-decoder framework.Our language style restricting setup assumes that there is a large open-domain parallel corpus that provides training for context-response relevance, and a smaller monologue speaker corpus that reflects the language characteristics of the target speaker.We will refer to this smaller set as a 'scenting' dataset, since it hints at, or insinuates, the characteristics of the target speaker.

Rank: Search in the Target Corpus
Our first approach to scenting is to simply use the all sentences in the target speaker's corpus as generation candidates, ranked by the objective (2) for a given prompt.Since these sentences are naturally-occurring instead of generated wordby-word, we can safely assume p(T ) is constant (and high), and so the objective only requires sorting the sentences based on the backward model p(S|T ).RNN-based ranking methods are among the most effective methods for retrieving relevant responses (Wang andNyberg, 2015, 2016).Thus this approach is a very strong baseline.Its limitation is also obvious: by limiting all possible responses to a fixed finite set of sentences, this method cannot provide a good response if such a response is not already in the scenting dataset.

M ultiply: Mixing the base model and the target language model during generation
In our second method we use both the vanilla encoder-decoder model trained on open-domain corpus and the target domain language model trained on the corpus while decoding output sentence.The idea is to use a speaker's language model, which is also RNN-based in our experiments, to restrict the open-domain encoderdecoder model's step-by-step word prediction.Similar ideas have been tested in domain adaptation for statistical machine translation (Koehn and Schroeder, 2007), where both in-domain and open-domain translation tables were used as can-didates for generating target sentence.Because open-domain encoder-decoder models are trained with various kinds of language patterns and topics, choosing a sequence that satisfies both models may produce relevant responses that are also in the target language style.We found that a straightforward way of achieving this is to multiply the two models' distributions p 1 (t|S) λ 1 p 2 (t) λ 2 at each point and then re-normalize before sampling.The weights can be tuned either by the perplexity on the validation set, or through manually controlling the trade-off between style restriction and answer accuracy.

F inetune: Over-training on Target Corpus with Pseudo Context
Fine-tuning is a widely used in the neural network community to achieve transfer learning.This strategy permits us to train the neural encoder-decoder on a larger general parallel corpus, and then use the learned parameters to initialize the training of a styled model.Most of the time, however, the target speaker's corpus will lack training data in parallel form.For example, if we train on song lyrics or movie scripts, or political speeches, the data will not be in a question-answer form.To make encoder-decoder overtraining possible, we treat every sentence in the scenting corpus as a target sentence T generated a pseudo context from the backward model p(S|T ) trained on the opendomain corpus.Over-training on such pairs imparts the scenting dataset's language characteristics, while retaining the generality of the original model.We also found that the previous sentence in the styled corpus (i.e., previous sentence in the speech) provides helpful context for the current sentence, analogous with a question-answer link.Thus we use both pseudo context and the previous sentence as possible sources S to fine-tune the in-domain decoder.To avoid overfitting, we stop overtraining when the perplexity on the in-domain validation set starts to increase.A corresponding sample acceptor is also trained for the fine-tuned model: we found it helpful to initialize this from the open-domain model's sample acceptor.

Restricting the Output Topic
We further introduce a topic restricting method for neural decoders based on the Counting Grid (Jojic and Perina, 2011) model, by treating language guidance as a topic embedding.Our model exten-sion provides information about the output topic in the form of an additional topic embedding vector to the neural net at each time step.

CG: Counting Grids
The basic counting grid π k is a set of distributions on the d-dimensional toroidal discrete grid E indexed by k.The grids in this paper are bidimensional and typically from (E x = 32) × (E y = 32) to (E x = 64) × (E y = 64) in size.
The index z indexes a particular word in the vocabulary z = [1 . . .Z]. Thus, π i (z) is the probability of the word z at the d-dimensional discrete location i, and z π i (z) = 1 at every location on the grid.The model generates bags of words, each represented by a list of words w = {w n } N n=1 with each word w n taking an integer value between 1 and Z.The modeling assumption in the basic CG model is that each bag is generated from the distributions in a single window W of a preset size, e.g., (W x = 5) × (W y = 5).A bag can be generated by first picking a window at a d-dimensional location , denoted as W , then generating each of the N words by sampling a location k n for a particular micro-topic π kn uniformly within the window, and sampling from that micro-topic.
Because the conditional distribution p(k n | ) is a preset uniform distribution over the grid locations inside the window placed at location , the variable k n can be summed out (Jojic and Perina, 2011), and the generation can directly use the grouped histograms where |W| is the area of the window, e.g. 25 when 5×5 windows are used.In other words, the position of the window in the grid is a latent variable given which we can write the probability of the bag as As the grid is toroidal, a window can start at any position and there is as many h distributions as there are π distributions.The former will have a considerably higher entropy as they are averages of many π distributions.Although the basic CG model is essentially a simple mixture assuming the existence of a single source (one window) for all the features in one bag, it can have a very large number of (highly related) choices h to choose from.Topic models (Blei et al., 2003;Lafferty and Blei, 2006), on the other hand, are admixtures that capture word co-occurrence statistics by using a much smaller number of topics that can be more freely combined to explain a single document (and this makes it harder to visualize the topics and pinpoint the right combination of topics to use in influencing the output).
In a well-fit CG model, each data point tends to have a rather peaky posterior location distribution because the model is a mixture.The CG model can be learned efficiently using the EM algorithm because the inference of the hidden variables, as well as updates of π and h can be performed using summed area tables (Crow, 1984), and are thus considerably faster than most of the sophisticated sampling procedures used to train other topic models.The use of overlapping windows helps both in controlling the capacity of the model and in organizing topics on the grid automatically: Two overlapping windows have only slightly different h distributions, making CGs especially useful in visualization applications where the grid is shown in terms of the most likely words in the component distributions π (Perina et al., 2014). 1aving trained the grid on some corpus (in our case a sample of the base model's corpus), the mapping of either a source S and/or target T sentence can be obtained by treating the sentences as bags of words.By appending one or both of these mappings to the decoder's embedding of the target T , the end-to-end encoder-decoder learning can be performed in a scenario where the decoder is expected to get an additional hint through a CG mapping.In our experiments, we only used the embedding of the target T as the decoder hint, and we appended the full posterior distribution over CG locations to the encoder's embedding.At test time, we only have the S and need to generate T without knowing where it may map in the counting grid.We considered two ways of providing a mapping: • The user provides a hint sentence H (could be just a few words in any order), and the CG mapping of the user's hint, i.e. the full posterior distribution p( |H), is used in the decoding.The posterior probabilities over 32 × 32 grid locations are unwrapped into a vector As an example, Figure 1 shows a portion of a CG trained on randomly chosen 800k tweets from the twitter corpus.In each cell of the grid, we show the top words in the distribution π j (z) over words (z) in that location (j).(Each cell has a distribution over the entire vocabulary).As a response to "I am hungry," using two highlighted areas as hints, we can generate either a set of empathic responses, such as 'Me too,' or food suggestions, such as 'Let's have cake.'It will also be evident that some areas of the grid may produce less sensical answers.These can later be pruned by likelihood criteria or by user selection.

Datasets
Yahoo! Answer Dataset.We use the Comprehensive Questions and Answers dataset2 to train and validate the performances of different decoding setups with ranking experiments described in section 7.3.This dataset contains 4.4 million Yahoo!Answers questions and the user-selected best answers.Unlike the conversational datasets, such as the Twitter dataset described below, it contains more relevant and specific responses for each question, which leads to less ambiguity in ranking.
Twitter Conversation Dataset.We trained our base encoder-decoder models on the Twitter Conversation Triple Dataset described in (Sordoni et al., 2015), which consists of 23 million conversational snippets randomly selected from a collection of 129M context-message-response triples extracted from the Twitter Firehose over the 3-month period from June through August 2012.For the purposes of our experiments, we split the triples into context-message and message-response pairs yielding 46M source-target pairs.For tuning and evaluation, we used the development dataset of size 200K conversation pairs and the test dataset of 5K examples.The corpus is preprocessed using a Twitter specific tokenizer (O'Connor et al., 2010).
The vocabulary size is limited to 50,000 excluding the special boundary symbol and the unknown word tag.
Scenting datasets.A variety of persona characters have been trained and tested, including Hillary Clinton, Donald Trump, John F. Kennedy, Richard Nixon, singer-songwriters, stand-up comedians, and a generic Star Wars character.In experiments, we evaluated on a diverse set of representative target speakers: JFK.We mainly tested our models on John F. Kennedy's speeches collected from American Presidency Project3 , which contains 6474 training and 719 validation sentences.
Star Wars.Movie subtitles of three Star Wars movies are also tested4 .They are extracted from Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011), and have 495 training and 54 validation sentences.
Singer-Songwriter.We also evaluated our approach on a lyric corpus from a collective of singers: Coldplay, Linkin Park, and Green Day.The lyric dataset is collected from mldb.org and has 9182 training and 1020 validation lines.
Debate Chat Contexts.We designed testing questionnaires with 64 chat contexts spanning a range of topics in politic, science, and technology: the sort of questions we might ask in an entertaining political debate. 5To test the model's ability to control output topic in section 7.4.3,we also created one hint per question.

Network Setup and Implementation
Our encoder and decoder RNNs contains twolayer stacked LSTMs.Each LSTM layer has a memory size of 500.The network weights are randomly initialized using a uniform distribution (−0.08, 0.08), and are trained with the ADAM optimizer (Kingma and Ba, 2014), with an initial learning rate of 0.002.Gradients were clipped so their norm does not exceed 5.Each mini-batch contains 200 answers and their questions.The words of input sentences were first converted to 300-dimensional vector representations learned from the RNN based language modeling tool word2vec (Mikolov et al., 2013).The beginning and end of each passage are also padded with a special boundary symbol.During decoding, our model generates 500 candidate samples in parallel, then ranks them.As these are processed in batches on GPU, generation is very efficient.We also experimented incorporating an information retrieval (IR) module to automatically collect topic hints for CG-based decoder.Specifically, a full-text index of twitter corpus is built using solr6 , and the top 10 searched results based on the source sentence are be used to generate posterior CG distributions as hints.

Validating the Decoding Setup with Ranking
We performed a ranking evaluation applying different decoding setups on the Yahoo!Answers dataset.Here we wanted to test the relevance judgment capacities of different setups, and validate the necessity of the new decoding method discussed in section 4. Yahoo!Answers question is used as source S, and its answer is treated as target T .Each test question is associated with one true answer and 19 random answers from the test set.MRR (Mean Reciprocal Rank) and P@1 (precision of top1) were then used as evaluation metrics.
Table 2 shows the answer ranking evaluation results: the forward model P (T |S), by itself is close to the performance of random selection in distinguishing true answer from wrong answers.This implies that a naive beam search over only the forward model may generate irrelevant outputs.One hypothesis was that P (T |S) is biased toward P (T ), and performance indeed improves after normalizing by P (T ).However, it is difficult to directly decode with objective P (T |S)/P (T |∅), because this objective removes the influence of the target-side language model.Decoding only according to this function will thus result in only low-frequency words and ungrammatical sentences, behavior also noted by (Li et al., 2015;Shao et al., 2017).We tested 10 different system configurations to evaluate the overall output quality, and their abilities of influencing output language style and topic: • vanilla-sampling each word in the target.
• selective-sampling as described in section 4; all the following systems are using it as well.• cg-ir uses IR results to create counting grid topic hints (sections 6.1 and 7.2).• rank uses proposals from the full JFK corpus as in section 5.1.• multiply with a JFK language model as in section 5.2.• finetune with JFK dataset as in section 5.3.
• finetune-cg-ir uses IR results as topic hints for fine-tuned JFK.• finetune-cg-topic forced to use the given topic hint for fine-tuned JFK.• singer-songwriter fine-tuned cg-topic.

Evaluation Setup
Owing to the low consistency between automatic metrics and human perception on conversational tasks (Liu et al., 2016;Stent et al., 2005) and the lack of true reference responses from persona models, we evaluated the quality of our generated text with a set of judges recruited from Amazon Mechanical Turk (AMT).Workers were selected based on their AMT prior approval rate (>95%).Each questionnaire was presented to 3 different workers.We evaluated our proposed models on the 64 debate chat contexts.Each of the evaluated methods generated 3 samples for every chat context.To ensure calibrated ratings between systems, we show the human judges all system outputs (randomly ordered) for each particular test case at the same time.For each chat context, we conducted three kinds of assessments: Quality Assessment Workers were provided with the following guidelines: "Given the chat In this test, the outputs of all 10 systems evaluated are then provided to worker together for a total of 30 responses.In total, we gathered 64 • 30 • 3 = 5760 ratings for quality assessments, and 47 different workers participated.
Style Assessment.We provided following instructions: "Which candidate responses are likely to have come from or are related to [Persona Name]?".Checkboxes were provided for the responses from style-influenced systems and from selective-sampling as a baseline.
Topic Assessment.The instruction was: "Which candidate answers to the chat context above are similar or related to the following answer: '[a hint topic provided by us]'?".This was also a checkbox questionnaire.Candidates are from both style-and topic-influenced systems (fine-tuned cg-topic), and from selective-sampling as a baseline.4: The style and topic assessments (both binary) of three models with different personas and with restriction of specific target topic for each chat context.All style and topic results are statistically significant compared to the Base (selectivesampling) by paired t-tests with p < 0.5%.

Results
Overall Quality.We conducted mean opinion score (MOS) tests for overall quality assessment of generated responses with questionnaires described above.Table 3 shows the MOS results with standard error.It can be seen that all the systems based on selective sampling are significantly better than vanilla sampling baseline.When restricting output's style and/or topic, the MOS score results of most systems do not decline significantly except singer-songwriter, which attempts to generate lyrics-like outputs in response to to political debate questions, resulting in uninterpretable strings.
Our rank method uses p(S|T ) to pick the answer from the original persona corpus, and is thus as good at styling as the person themselves.Because most of our testing questionnaire is political, the rank was indeed often able to find related answers in the dataset (JFK).Also, unlike generation-based approaches, rank has oraclelevel language fluency and it is expected to have quality score of at least 2 ("Interpretable, but not related").Overall, however, the quality score of rank is still lower than other approaches.Note that a hybrid system can actually chose between rank and the decoder's outputs based on likelihood, as shown in the example of bJFk-bNixon debate in the supplemental material.
Influencing the Style.Table 3 also shows the likelihood of being labeled as JFK for different methods.It is encouraging that finetune based approaches have similar chances as the rank system which retrieves sentences directly from JFK corpus, and are significantly better than the selectivesampling baseline.
Influencing both Style and Topic.Table 4 summarizes the results in terms of style (the fraction of answers labeled as in-style for the target persona), and topic (the percentage of answers picked as related to the human-provided topic hint text).We used the last three of the ten listed systems, which are both styled and use specific topic hints to generate answers.These results demonstrate that it is indeed possible to provide simple prompts to a styled model and drive their answers in a desired direction while picking up the style of the persona.It also shows that the style of some characters is harder to recreate than others.For example, workers are more likely to label baseline results as lyrics from a singer-songwriter than lines from Star Wars movies, which might be because lyrics often take significant freedom with structure and grammar.We also found that it is harder for Star Wars and Singer-Songwriter bots to follow topic hints than it is for the John F. Kennedy model, largely because the political debate questions we used overlap less with the topics found in the scenting datasets for those two personas.

Conclusions
In this study we investigated the possibility of steering the style and content in the output of a neural encoder-decoder model7 .We showed that acquisition of highly recognizable styles of famous personalities, characters, or professionals, is achievable, and that it is even possible to allow users to influence the topic direction of conversations.The tools described in the paper are not only useful in conversational systems (e.g., chatbots), but can also be useful as authoring tools in social media.In the latter case, the social media users might use neural models as consultants to help with crafting responses to any post the user is reading.The AMT tests show that these models do indeed provide increased recognizability of the style, without sacrificing quality or relevance.

Figure 1 :
Figure 1: A part of a Counting Grid trained on Twitter data and its use in providing topical hints in decoding.For the source sentence at the top, the decoder may produce the two target samples on the right, if the circled locations are used as a hint, or the two sentences at the bottom if the locations in the lower right are picked.

Table 1 :
Example responses from our Star Wars, Hillary, Trump, and Kennedy bots with scented conversation models.

Table 2 :
Ranking the true target answer among random answers on Yahoo!Answers test set.