Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations

Quotations are crucial for successful explanations and persuasions in interpersonal communications. However, finding what to quote in a conversation is challenging for both humans and machines. This work studies automatic quotation generation in an online conversation and explores how language consistency affects whether a quotation fits the given context. Here, we capture the contextual consistency of a quotation in terms of latent topics, interactions with the dialogue history, and coherence to the query turn's existing content. Further, an encoder-decoder neural framework is employed to continue the context with a quotation via language generation. Experiment results on two large-scale datasets in English and Chinese demonstrate that our quotation generation model outperforms the state-of-the-art models. Further analysis shows that topic, interaction, and query consistency are all helpful to learn how to quote in online conversations.


Introduction
Quotations, or quotes, are memorable phrases or sentences widely echoed to spread patterns of wisdom (Booten and Hearst, 2016). They are derived from the ancient art of rhetoric and now appearing in various daily activities, ranging from formal writings (Tan et al., 2015) to everyday conversations (Lee et al., 2016), all help us present clear, beautiful, and persuasive language. However, for many individuals, writing a suitable quotation that fits the ongoing contexts is a daunting task. The issue becomes more pressing for quoting in online conversations where quick responses are usually needed on mobile devices (Lee et al., 2016).
To help online users find what to quote in the discussions they are involved in, our work studies how to recommend an ongoing conversation with * Xingshan Zeng is the corresponding author.  a quote and ensure its continuity of senses with the existing contexts. For task illustration, Figure  1 displays a Reddit conversation snippet centered around the worthiness to buy a Scuf controller. To argue against T 4's viewpoint, we see the query turn quotes Tusser's old saying for showing that buying a controller is a waste of money. As can be observed, it is important for a quotation recommendation model to capture the key points being discussed (reflected by words like "money" and "dumb" here) and align them to words in the quotation to be predicted (such as "fool" and "money"), which allows to quote something relevant and consistent to the previous concern.
To predict quotations, our work explores semantic consistency of what will be quoted and what was given in the contexts. In context modeling, we distinguish the query turn (henceforth query) and the other turns in earlier history (henceforth history), where topic, interaction, and query consistency work together to determine whether a quote fits the contexts. Here topic consistency ensures that the words in quotation reflect the discussion topic (such as "fool" and "money" in Figure 1). Interaction consistency is to identify the turns in history to which the query responds (e.g., T 1 and T 4 in Figure 1) and guide the quote to follow such interaction. Query consistency measures the language coherence of quote in continuing the story started by the query. For example, the quote in Figure 1 is to support the query's argument.
In previous work of quotation recommendation, there are many methods designed for formal writings (Tan et al., 2015;; whereas much fewer efforts are made for online conversations with informal language and complex interactions in their contexts. Lee et al. (2016) use a ranking model to recommend quotes for Twitter conversations. Different from them, we attempt to generate quotations in a word-by-word manner, which allows the semantic consistency of quotes and contexts to be explored.
Concretely, we propose a neural encoderdecoder framework to predict a quotation that continues the given conversation contexts. We capture topic consistency with latent topics (i.e., word distributions), which are learned by a neural topic model (Zeng et al., 2018a) and inferred jointly with the other components. Interaction consistency is modeled with a turn-based attention over the history turns, and the query is additionally encoded to initialize the decoder's states for query consistency. To the best of our knowledge, we are the first to explore quotation generation in conversations and extensively study the effects of topic, interaction, and query consistency on this task.
Our empirical study is conducted on two largescale datasets, one in Chinese from Weibo and the other in English from Reddit, both of which are constructed as part of this work. Experiment results show that our model significantly outperforms both the state-of-the-art model based on quote rankings (Lee et al., 2016) and the recent topic-aware encoder-decoder model for social media language generation (Wang et al., 2019a). For example, we achieve 27.2 precision@1 on Weibo compared with 24.0 by Wang et al. (2019a). Further discussions show that topic, interaction, and query consistency can all usefully indicate what to quote in online conversations. We also study how length of history and quotation affects the quoting results and find that we perform consistently better than comparison models in varying scenarios.

Related Work
Our work is in the line with content-based recommendation  or cloze-style read-ing comprehension (Zheng et al., 2019), which learns to put suitable text fragments (e.g., words, phrases, sentences) in the given contexts. Most prior studies explore the task in formal writings, such as citing previous work in scientific papers (He et al., 2010), quoting famous sayings in books (Tan et al., 2015(Tan et al., , 2016, and using idioms in news articles Zheng et al., 2019). The language they face is mostly formal and welledited, while we tackle online conversations exhibiting noisy contexts and hence involving quote consistency modeling with turn interactions. Lee et al. (2016) also recommend quotations for conversations. However, they consider quotations as discrete attributes (for learning to rank) and hence largely ignore the rich information reflected by a quotation's internal word patterns. Compared with them, our model learns to quote with language generation, which can usefully exploit how words appear in both contexts and quotations.
For methodology, we are inspired by the encoder-decoder neural language generation models Bahdanau et al., 2014). In dialogue domains, such models have achieved huge success in digesting contexts and generate microblog hashtags (Wang et al., 2019b), meeting summaries , dialogue responses (Hu et al., 2019), etc. Here we explore how the encoder-decoder architecture works to generate quotations in conversations, which has never been studied in existing work. Our study is also related to previous research to understand conversation contexts (Ma et al., 2018;Liu and Chen, 2019;, where it is shown to be useful to capture interaction structures (Liu and Chen, 2019) and latent topics (Zeng et al., 2019). For latent topics, we are benefited from the recent advance of neural topic models (Miao et al., 2017;Wang et al., 2019a)), which allows end-toend topic inference in neural architectures. Nevertheless, none of the above work attempts to study the semantic consistency of quotes in conversation contexts, which is a gap our work fills in.

Our Quotation Generation Model
This section describes our neural encoder-decoder framework that generates quotations in conversations, whose architecture is shown in Figure 2. The encoding process works for context modeling of turn interactions (described in Section 3.1) and latent topics (presented in Section 3.2). For the decoding process to be discussed in Section 3.3, we predict words in quotes taking topic, interaction, and query consistency into consideration. The learning objective of the entire framework will be given at last in Section 3.4.

Interaction Modeling
To describe turn interactions, we first assume that there are m chronologically-ordered turns given as contexts and each turn T i is formulated as a word sequence w i,1 , w i,2 , ..., w i,n i (n i denotes the number of words). We consider the m-th turn as the query while others the history (T history = T 1 , T 2 , ..., T m−1 ). Here we distinguish the query and its earlier history to separately explore the quote's language coherence to the query (for query consistency) and its interaction consistency to the earlier posted turns. In the following, we will describe how to encode history and query turns, and how the learned representations work together to explore conversation structure.
History Encoder. Here we describe how to encode turns in history. We first feed each word w ij (the j-th word in the i-th turn) in history into an embedding layer to obtain its word vector c i,j . Then word vectors of the i-th turn C i = c i,1 , c i,2 , ..., c i,n i are further processed with a bidirectional gated recurrent unit (Bi-GRU) (Cho et al., 2014b). Its hidden states are defined as: The turn-level representations are hence captured by concatenating the last hidden states of . Further, we define the history representations as h c = h c 1 , h c 2 , ..., h c m−1 , which will be further used to encode the interaction structure (described later).
Query Encoder. Similar to the way we encode each turn in history, a Bi-GRU is first employed to learn query representations q = h c m . Then, we identify which turns in history the query responds to for learning interaction consistency. To this end, we put a query-aware attention over the history turns and result in a context vector below: Afterwards, we enrich query representations with the features from history and obtain the historyaware query representations: where W q and b q are learnable parameters.
Structure Encoder. With the representations learned above for queryq and history h c , we can further explore how turns interact with their neighbors (henceforth conversation structure) with another Bi-GRU. It is fed with the h c 1 , h c 2 , ..., h c m−1 ,q sequence and the hidden states sequence h 1 , h 2 , ..., h m−1 , h m is further put into a memory bank M for decoder's attentive retrieval in quotation generation (see Section 3.3).

Topic Modeling
Following the common practice (Blei et al., 2003;Miao et al., 2017), we model topics following the bag-of-words (BoW) assumption. Hence, we form a BoW vector x bow (over vocabulary V ) of the words in context to learn its discussion topic. The topic inference process is inspired by neural topic models (NTM) (Miao et al., 2017). It is based on a variational auto-encoder (VAE) (Kingma and Welling, 2013) involving an encoding and a decoding step to reconstruct the BoW of contexts.

BoW Encoding
Step. This step is designed to learn a latent topic variable z from x bow . Here words in conversation contexts are assumed to satisfy a Gaussian distribution prior on mean µ and standard deviation σ (Miao et al., 2017). They are estimated by the following formula: where f * (·) is a neural perceptron performing a linear transformation activated with an ReLU function (Nair and Hinton, 2010).

BoW Decoding
Step. Conditioned on the latent topic z, we further generate words to form the BoW of each conversation x bow . Here we assume each word w n ∈ x bow is drawn from the conversation's topic mixture θ, which is a distribution vector over the topics. In the following, we show the generation story to decode x bow : • Draw latent topic z ∼ N (µ, σ 2 ).
• For the n-th word in the conversation: -Draw the word w n ∼ sof tmax(f φ (θ)). Here f * (·) is a ReLU-activated neural perceptron defined above. The topic mixture θ will be later applied to capture topic consistency when predicting the quotation.

Quotation Generation
To predict the quotation y, we first define the probability of words in it with the following formula: where y <i = y 1 , y 2 , ..., y i−1 and |y| denotes the quotation's word number. In prediction, the i-th word is generated with a likelihood p i = P r(y i |y <i , M, θ), which is jointly determined by the words appearing before it (y <i ) and the contexts features delivered by M (turn interactions described in Section 3.1 ) and θ (the discussion topic described in Section 3.2). Below comes more details of how we follow the semantic consistency of contexts to generate quotations.
Query Consistency. To carry on query's senses, the quotation is decoded with an unidirectional GRU initialized based on the encoded query. The initialization and later recursion of decoder's hidden states are given as: where W 0 and b 0 are parameters to be learned. v i is the embedded decoder input to predict the i-th word in quotation. 1 In decoding, word prediction is conducted sequentially with beam search. It results in a ranking list of output, where we take the top K for quotation matching described later.
Topic and Interaction Consistency. For modeling quote consistency of discussion topics (with θ) and turn interactions (with M ), we design a turnbased attention over conversation contexts to decode the quotation. Its attention weights are computed in aware of the structure-encoded turn representations h j from M and topic distribution θ: where f d (h d i , h j , θ) captures the topic-aware semantic dependency the i-th word in quotation to the j-th turn in contexts and is defined as: where h θ j = W θ [h j ; θ] + d θ , and parameters W d , b d , W θ , and d θ are all trainable. Then we give the context vector t i conveying both topic and interaction features for the i-th word to be generated: Finally, we predict the i-th word in quotation following the distribution p i defined to combine topic, interaction, and query consistency: where W p and b p are trainable parameters.
Quotation Matching. Occasionally language generation will "create" a non-existing quotation.
To avoid that, we take a post-processing step for the outputs absent in our quotation list. Following previous practice , we select a quote from the list with the minimum edit distance (by tokens) and consider it as the final output.

Learning Objective
For the entire framework, we design its learning objective to allow joint learning of latent topics and conversation quotations: Here L N T M is the objective function of neural topic model (NTM) defined as: where D KL (·) is the Kullback-Leibler divergence loss and E * [·] reflects the reconstruction loss. 2 As for L QGM , it is defined as the cross entropy loss over all training instances to train the quotation generation model (QGM): where N is the number of training instances. C n = {T history , T query } n represents the contexts of the n-th conversation and θ n is C n 's topic composition induced by NTM.

Experimental Setup
Datasets. For experiments, we construct two new datasets: one in Chinese from Weibo (a popular microblog platform in China and henceforth Weibo) and the other in English from Reddit (henceforth Reddit) 3 . Here the raw Weibo data is released by Wang et al. (2019a) and Reddit obtained from a publicly available corpus. 4 For both Weibo and Reddit, we follow the common practice form conversations with posts and their comments (Li et al., 2015;Zeng et al., 2018b), where a post or comment is considered as a conversation turn.
To gather conversations with quotations, we maintain a quotation list and remove conversations containing no quotation from the list. For the remaining, if a conversation has multiple quotes, we construct multiple instances where one corresponds to the prediction of a quotation therein. On Weibo, we explore the quoting of Chinese Chengyu. 5 For Reddit, we obtain the quotation list from Wikiquote. 6 Afterwards, we remove conversation instances with quotations appearing less than 5 times to avoid sparsity (Tan et al., 2015). Finally, the datasets are randomly splitted into 80%, 10%, and 10%, for training, development, and test.
The statistics of the two datasets are shown in Table 1. We observe that the two datasets exhibit different statistics. For example, from the average turn number in contexts, we find Reddit users 3 The datasets are available at https: //github.com/Lingzhi-WANG/ Datasets-for-Quotation-Recommendation 4 https://files.pushshift.io/reddit/ comments/ 5 https://en.wikipedia.org/wiki/Chengyu Chengyu can be seen as a quotable phrase (Wang and Wang, 2013) -memorable rhetorical figures to convey wit and striking statement (Bendersky and Smith, 2012 Figure 3(a) and position in Figure 3(b). Figure 3(a) shows only a few quotations are commonly used in online conversations, probably because of its informal writing style. While for Figure 3(b), we find only a few Weibo conversations quote 5 turns later while the distribution on Reddit is much flatter. Preprocessing. To preprocess Weibo data, we adopted open-source Jieba toolkit 7 for Chinese word segmentation. For Reddit dataset, we employ natural language toolkit (NLTK 8 ) for tokenization. In BoW preparation, all stop words and punctuation were removed following common practice to train topic models (Blei et al., 2003).
Parameter Setting. Here we describe how we set our model. In model architecture, the hidden size of all GRUs is set to 300 (bi-direction, 150 for each direction). For encoder, we adopt two layers of bidirectional GRU, and unidirectional GRU for decoder. The parameters in NTM are set up following Zeng et al. (2018a). For input, we set the maximum turn length to 150 for Reddit and 200 for Weibo, and the maximum quotation length 20. Word embeddings are randomly initialized to 150-dimensional vectors. In model training, we employ Adam optimizer (Kingma and Ba, 2015), with 1e − 3 learning rate and the adoption of early stop (Caruana et al., 2001). Dropout strategy (Srivastava et al., 2014) is also used to avoid overfitting. We adopt beam search (beam size = 5) to generate a ranking list for quote recommendation.
Evaluation Metrics. We first adopt recommendation metrics with popular information retrieval metrics Precision at K (P@K) and mean average precision (MAP) scores (Schütze et al., 2008) used. For P@K, K=1 to measure the top prediction, while for MAP we consider the top 5 outputs. Here we measure the generation models with their predictions after quotation matching (Section 3.3). Then, generation metrics are employed to evaluate word-level predictions. Here we consider both ROUGE (Lin, 2004) from summarization (F1 scores of ROUGE-1 and ROUGE-L are adopted) and BLEU (Papineni et al., 2002) from translation. To allow comparable results, generation models are measured with their original outputs (without quotation matching) while for ranking competitors, we take their top-1 ranked quotes.
Comparisons. We first adopt two weak baselines that select quotations unaware of the target conversation: 1) RANDOM: selecting quotations randomly; 2) FREQUENCY: ranking quotations with frequency. Then, we compared two ranking baselines: 3) non-neural learning to rank model (henceforth LTR) with handcrafted features proposed in Tan et al. (2015). 4) CNN-LSTM (Lee et al., 2016): previous quotation recommendation model (CNN for turn and quotation encoding and LSTM for conversation structure). Next, we consider the encoder-decoder generation models without modeling conversation structure: 5) SEQ2SEQ (Cho et al., 2014a): using an RNN for encoding and another RNN for decoding; 6) TAKG: Seq2Seq framework incorporating latent topics for decoding. 7) the state-of-the-art (SOTA) model NCIR  designed for Chinese idiom generation.
Finally, the following of our variants are test: 8) IE ONLY: using interaction modeling results for decoding (w/o topic and query consistency modeling); 9) IE+QE: coupling interaction and query consistency (w/o NTM used for topic consistency); 10) IE+QE+NTM: our full model.

Experimental Results
In this section, we first show the main comparison results in Section 5.1. Then Section 5.2 discusses what we learn to represent consistency. Finally, Section 5.3 presents more analysis to characterize quotations in online conversations. Table 2 reports the main comparison results on two datasets, where our full model significantly outperforms all comparisons by a large margin. Several interesting observations can be drawn:

Main Comparison Results
• Quotation is related with context. The poor performance of weak baselines reveals the challenging nature of quoting in online conversations. It is not possible to learn what to quote without considering context.
• Generation models outperform Ranking. Generation models in encoder-decoder style perform much better than ranking. It maybe attributed to generation model's ability to learn word-level mapping from source context to quotation.
• Interaction, query, and topic consistency are all useful. We see IE ONLY outperforms SEQ2SEQ, showing that interaction modeling helps encode indicative features from context. Likewise, the results of IE+QE are better than IE ONLY, and IE+QE+NTM better than IE+QE, both suggesting that learning query and topic consistency contribute to yield a better quotation.
• Quoting in Reddit is more challenging than Weibo Chengyu. All models perform worse on Reddit than Weibo. The possible reason is that Chinese Chengyu is shorter and renders a smaller vocabulary than English quotes (see Table 1).

Quotation and Consistency
We have shown our effectiveness in main results.
Here we further examine our learned consistency and their effects on quoting. In the rest of this paper, without otherwise specified, our model is used as a short form of our full model (IE+QE+NTM). For comparison, we select TAKG for its best performance in Table 2    Interaction Consistency. To understand the positions of turns a quote is likely to respond to, we display the turn-based attention weights (Eq. 7) over turn position in Figure 4. Also shown is the attention weights from TAKG (Wang et al., 2019a) for comparison. Here we use Reddit conversations for interpretation because they involve larger turn number (see Table 1). It is seen that TAKG can only attend the first three turns while we assign higher weights to turns closer to query. In doing so, the quotes will continue senses from later history, which fits our intuition that participants tend to interact with latest information.
Query Consistency. We carry out a human evaluation to test the coherence of query and the predicted quotations. 100 conversations are sampled from Weibo and two native Chinese speakers are invited to examine whether a quote carry on the query's senses ("yes") or not ("no"). Ta  ble 3 shows the count of "yes" for the ground truth quote and the output of IE ONLY and IE+QE. Interestingly, even ground truth quotations cannot attain over 85% "yes", probably because of the prominent misuse of quotations on social media. Nevertheless, the better performance of IE+QE compared with IE ONLY shows the usefulness to model query consistency for ensuring quotation's language coherence to the query.
Topic Consistency. Here we use the example in Figure 1 to analyze the topics we learn for modeling consistency. Recall that the conversation centers around price and value and the quote is used to argue that only fools will waste the money. We look into the top 3 latent topics (by topic mixture θ) and display their top 10 words (by likelihood) in Table 4. There appears words like "pay" and "stupied", which might help to correctly predict "fool" and "money" in the quote. Topic 1 game property child rights pay guy state church guys paid Topic 2 f**k evidence sh*t guys stupid edit nice proof dude dumb Topic 3 car buy cops police scrubs gun technology shot crime energy Table 4: The top 10 words of the 3 latent topics related to the conversation in Figure 1. Words suggesting conversation's focus are in blue and italic.

Sensitivity to Context and Quotations
In this section, we study how varying context and quotations affect our performance.
The Effects of Context. Here we examine whether longer context will result in better results. In the following, we measure context length in terms of turn number and token number.
Turn Number. Figure 5 shows our MAP scores to quote for Reddit conversations with varying turn number. Weibo results are not shown here for the limited data with turn number > 4. Generally, more turns result in better MAP, for the richer information to be captured from turn interactions. The scores drop for turn number > 8, probably because of underfitting and a more complex model might be needed for interaction modeling.
To further explore model's sensitivity to turn number, we first rank the conversations with turn number and separate them into four quartiles (Q 1 , Q 2 , Q 3 , Q 4 , in order with increasing turn number). We then train and test in each quartile, and compare the results of our model and TAKG in Figure 6(a). As can be seen, our model presents larger margin for quartiles corresponding to larger turn number, indicating our ability to encode rich information from complex turn interactions.
Token number. For context length measured with token number, we follow the above steps to form train and test quartiles for token number. The  results are shown in Figure 6(b) where our model consistently outperform TAKG over conversation context with varying token number.
The Effects of Quotation. We further study our results to predict quotations in varying frequency and the MAP scores are reported in Figure 7(a). In general, higher scores are observed for more frequent quotations, as better representations can be extensively learned from training data. We also notice a slower growing rate as the frequency increases. To go into more details, we compare the growing rates with ranking model CNN+LSTM and show the results in 7(b) on Weibo (Reddit results in similar trends). In comparison, we are generally less sensitive to quotation frequency (except for very rare quotes). It is likely to be benefited from quotations' internal structure while ranking models can be largely affected by label sparsity.

Further Discussions
Here we probe into our outputs to provide more insights to quoting in conversations.
Case Study. We first present a qualitative analysis over the example in Figure 1. To analyze what the model learns, we visualize our turn-based attention and TAKG's topic-aware attention over words in Figure 8. As can be seen, TAKG focuses more on topic words "Scuf", "suggest", and "controller", all reflecting the global discussion focus while ignoring query's intention. Thus, it mistakenly quote "A penny saved is a penny earned.". Instead, we attend the query's interaction with T 1 and T 4 , which results in the correct quotation.
Comparing with Human. Finally, we discuss how human performs on our task. 50 Weibo conversations were hence sampled and two human annotators (native Chinese speakers) were invited to quote a Chinese Chengyu in the given context. The two annotators give 7 and 8 correct answers respectively, which shows the task is challenging for human. Our model made 13 correct predictions, exhibiting a better ability to quote in online conversations.

Conclusion
We present a novel quotation generation framework for online conversations via the modeling of topic, interaction, and query consistency. Experiment results on two newly constructed online conversation datasets, Weibo and Reddit, show that our model outperforms the previous state-of-theart models. Further discussions provide more insights on quoting in online conversations.