Thread Popularity Prediction and Tracking with a Permutation-invariant Model

The task of thread popularity prediction and tracking aims to recommend a few popular comments to subscribed users when a batch of new comments arrive in a discussion thread. This task has been formulated as a reinforcement learning problem, in which the reward of the agent is the sum of positive responses received by the recommended comments. In this work, we propose a novel approach to tackle this problem. First, we propose a deep neural network architecture to model the expected cumulative reward (Q-value) of a recommendation (action). Unlike the state-of-the-art approach, which treats an action as a sequence, our model uses an attention mechanism to integrate information from a set of comments. Thus, the prediction of Q-value is invariant to the permutation of the comments, which leads to a more consistent agent behavior. Second, we employ a greedy procedure to approximate the action that maximizes the predicted Q-value from a combinatorial action space. Different from the state-of-the-art approach, this procedure does not require an additional pre-trained model to generate candidate actions. Experiments on five real-world datasets show that our approach outperforms the state-of-the-art.


Introduction
Online discussion forums allow people to join indepth conversations about different topics in form of threads. Each thread corresponds to one conversation, which is initiated by a post and users respond to it with comments. In addition, a comment can be further replied by another comment, forming a discussion tree. Users who are interested in a particular thread will subscribe to it. After the subscription, users will receive a notification when a new comment arrives in that thread. However, the speed of content generation in a well-known discussion forum is breakneck. For instance, in Reddit 1 , there were more than 900 million comments posted in 2017 (Reddit, 2017). Hence, merely pushing every new comment to the subscribers leads to a poor user experience. Motivated by this issue, He et al. (2016c) proposed the task of thread popularity prediction and tracking. When N new comments arrive in a thread, the system performs one step of recommendation by pushing K comments to the subscribers. We want to maximize the sum of popularities of the recommended comments over all recommendation steps. The popularity of a comment is measured by the number of positive reactions it received, e.g., the rating. With the assumption that a user needs to know the prior context in order to understand a comment, the system can only recommend new comments that are in the subtrees of previously recommended comments. Thus, the selection of comments at the current recommendation step will affect the comments that we can choose in the future recommendation steps.
To incorporate the long-term consequences of recommendations, the task of thread popularity prediction and tracking has been formulated as a reinforcement learning problem, in which an agent selects an action (a set of K comments) according to its current state (previous recommended comments), with the goal of maximizing the cumulative reward (total popularities of the recommended comments over all recommendation steps). The optimal action of the agent at each step is the action that maximizes the Q-function, Q(s, a), which denotes the long-term reward of choosing action a in state s. In practice, we learn this Qfunction using a parametric function, Q(s, a; θ), where θ is the model parameter vector. Thus, the predicted optimal action of the agent is the action that maximizes Q(s, a; θ).
This reinforcement learning problem has two main challenges. First, we need to develop a parametric model, Q(s, a; θ), to approximate the Qfunction. Second, finding the action that maximizes Q(s, a; θ) requires the prediction of all N K possible actions, which is intractable. Thus, we need a procedure to approximate the predicted optimal action from a combinatorial action space.
To address the first challenge, He et al. (2016c) proposed a neural network model, DRRN-BiLSTM, to approximate the Q-function. In this model, a bi-directional long short-term memory (LSTM) (Graves and Schmidhuber, 2005) is used to encode the set of K comments in an action. To address the second challenge, they proposed the two-stage Q-learning procedure to approximate the predicted optimal action . In this procedure, the agent uses a pre-trained and less-sophisticated model to rank all possible actions, then it uses the DRRN-BiLSTM to re-rank the top-M actions and selects the best one. However, this approach has two limitations. First of all, bi-directional LSTM is a sequence model, which treats the set of K comments in an action as a sequence. Although they tried to fix this problem by feeding randomly-permuted comments to the model, a different permutation of the same set of comments leads to a different Q-value prediction. Thus, the agent may not consistently select the predicted optimal action. Secondly, the twostage Q-learning procedure requires an additional pre-trained model to generate candidate actions.
Our work addresses these two limitations as follows. We propose a novel neural network model, DRRN-Attention, to approximate the Qfunction. In our model, we use an attention mechanism (Bahdanau et al., 2014) to integrate the information from a set of comment into an action embedding vector. In a nutshell, the attention mechanism outputs a weighted sum of the comment representations, where the weights are learned by a subnetwork to indicate the importance of each comment. Thus, the action embedding is invariant to the permutation of the comments, which leads to a permutation invariant Qvalue prediction. Next, we employ a greedy procedure to approximate the action that maximizes Q(s, a; θ). This procedure only requires the prediction of O(N K) actions, which is significantly lower than N K . Moreover, it does not require an additional pre-trained model to generate candidate actions.
In our experiments, we evaluate the performance of our DRRN-Attention model and the greedy approximation procedure against the baselines on five real-world datasets. Experimental results demonstrate that our approach beats the baselines on four of the datasets and achieves a competitive performance on one of the datasets. Furthermore, we analyze the performance of our approach across four action sizes (K = 2, 3, 4, 5). Our approach consistently achieves a higher cumulative reward than the baselines across all these action sizes.
We summarize our contributions as follow: (1) a new neural network architecture to model the Qvalue of the agent which is invariant to the permutation of sub-actions; (2) a greedy procedure for the agent to select an action from the combinatorial action space without an additional pretrained model; and (3) the new state-of-the-art performances on five real-world datasets.

Reinforcement Learning in Text-based Tasks
Reinforcement learning has been widely applied in various text-based tasks. There are several articles in literature studying the tasks of mapping instruction manuals to a sequence of commands, such as game commands (Branavan et al., 2011), software commands (Branavan et al., 2010), and navigation directions (Vogel and Jurafsky, 2010).
In the task of text-based game, an agent selects a textual command from a set of feasible commands at every time step. Narasimhan et al. (2016) considered a special case that all the textual commands have a fixed structure, while He et al. (2016b) and  considered another case that all commands are free text.
In the task of thread popularity prediction and tracking, the agent selects a set of K comments from N available comments at every time step, where each comment is a free text. He et al. (2016c) proposed two different approaches to tackle this task. In their first approach, the agent uses the Deep Reinforcement Relevance Network (DRRN) (He et al., 2016b) to model the Qfunction of selecting a comment. In their second approach, the agent uses the DRRN-BiLSTM (He et al., 2016c) to model the Q-function of an ac-tion. To due with the combinatorial action space, the agent uses uniform sampling to generate a set of M candidate actions. To improve this random sampling scheme, they proposed the two-stage Qlearning procedure in their later work , which used a pre-trained model to generate M candidate actions. Their experimental results showed that using DRRN-BiLSTM with twostage Q-learning procedure outperforms all other existing methods. The difference between our model and DRRN-BiLSTM is that we use attention to encode a set of comments rather than using a bi-directional LSTM. Besides, the greedy procedure in our approach does not require any extra pre-trained model.  also considered a special case that the agent can access an external knowledge source to augment the state representation. This setting is orthogonal to this work since we focus on the action encoding and the approximation of predicted optimal action.
One line of research focused on the integration of sequence-to-sequence (SEQ2SEQ) model (Sutskever et al., 2014) and reinforcement learning framework, examples including dialogue generation (Dhingra et al., 2017;Su et al., 2016), question answering system (Buck et al., 2017), and machine translation (He et al., 2016a). In these tasks, the agent selects an action by generating a free text using a SEQ2SEQ model.

Deep Learning on Sets
Most of the deep learning models on sets employed attention to integrate information from a set of input. This idea was first introduced in the read-process-and-write network (Vinyals et al., 2016), which uses a process module to perform multiple steps of attention over a set of vectors to obtain a permutation-invariant embedding. Our work adapts this idea to aggregate a set of comment embedding vectors. In the domain of graph learning, several models (Sukhbaatar et al., 2015;Zhang et al., 2017) learn an embedding of a node by attending over its neighboring nodes. All of the above models can be interpreted as a special case of memory network Sukhbaatar et al., 2015;Zhang et al., 2017), if we view the set of feature vectors as external memory. Max-pooling is another promising technique for the problem of learning on sets. Qi et al. (2017) used max-pooling to aggregate the feature vectors of a set of 3D geometry points. Recently, Zaheer et al. (2017) derived the necessary and sufficient conditions for a neural network layer to be permutation invariant.

Popularity Prediction
Another related line of research is popularity prediction problem in a supervised learning setting. Yano and Smith (2010) used the LDA topic model (Blei et al., 2003) to predict the number of comments of a blog post in a political blog. There are also several studies focused on the task of predicting the number of reshares on Facebook (Cheng et al., 2014) and the number of retweets in tweeter based on the text content (Tan et al., 2014;Hong et al., 2011). Recently, Cheng et al. (2017) proposed a neural network model to learn comment embeddings for the task of community endorsement prediction in a supervised learning setting.

Discussion Tree
A discussion thread in an online forum can be represented as a tree. Each node in the tree stores a free text. The root node represents the post of the thread and each non-root node represents a comment of the thread. There is a directed edge from node u to node v if and only if a comment (or post) u is replied by comment v. This tree keeps growing as new comments are submitted to the thread.

Problem Definition
The task of thread popularity prediction and tracking is formally defined as a reinforcement learning problem. We use M t to denote the set of comments that are being tracked at time t. Given a discussion thread, we start an episode as follows. First, we initialize M 1 to be the post of a thread. Then, at each time step t the agent performs the following operations: • Read the current state s t , which is all the previously tracked comments {M 1 , . . . , M t }.
• Read N new comments, c t = {c t,1 , . . . , c t,N }, in the subtree of M t .
• Select a set of K comments from c t to recommend, where η c i t is the number of positive reactions received by comment c i t .
• Track the set of recommended comments in the next time step, M t+1 = a t .
The episode terminates when no more new comments appear in the subtree of M t . The goal of the agent is to maximize the cumulative reward.

Q-function
The state-action value function (or Q-function), Q(s, a), is defined as the expected cumulative reward starting from state s and taking action a.
More formally, Q(s, a) = E[ +∞ l=0 γ l r t+1+l |s t = s, a t = a], where γ ∈ (0, 1] is a discount factor for future rewards. Since the goal of the agent is to maximize the cumulative reward, the optimal action for each state is the action that achieves the highest Q-value. Thus, the Q-function is associated with an optimal policy: in every state, the agent selects the action that maximizes the Q-function, i.e., a t = argmax a Q(s t , a), ∀t. Since this Q-function is unknown to the agent, we approximate the Q-function using a parametric model, Q(s, a; θ), and update the parameters θ using received rewards.

Exploration-Exploitation Trade-off
The agent needs to balance the explorationexploitation trade-off when selecting an action. On one hand, the agent can choose the action with the highest estimated Q-value to exploit its current knowledge of the Q-function. On the other hand, the agent can choose a non-greedy action to get more information about the Q-value of other actions. The balance between exploration and exploitation can be achieved by using the -greedy policy, in which the agent selects a random action with probability , and selects a greedy action with probability 1 − . Note that the term "greedy" in the -greedy policy means that the agent selects the action that is predicted to be optimal, i.e., select a t = argmax a Q(s t , a; θ). It does not refer to the greedy procedure, which is used to approximate the predicted optimal action in a combinatorial action space.

DRRN-Attention Model
In this work, we propose a new deep neural network model, named DRRN-Attention, to approxi-mate the Q-function for the task of thread popularity prediction and tracking. The input to our model is a state, s t , and an action, a t = {c 1 t , . . . , c K t }, as defined in Section 3.2. The output is the prediction of Q-value, i.e., Q(s t , a t ; θ) ∈ R. Figure 1 illustrates the overall architecture of DRRN-Attention. We divide our model into three modules as follows.

Text Representation Module
The text representation module reads s t and Then, we use a 2-layer feedforward neural network to embed b st into a d-dimensional state embedding vector, m st ∈ R d . After that, we use another 2-layer feedforward neural network to embed b c i t into a d-dimensional comment embedding vector, m c i t ∈ R d , for i = 1, . . . , K. This module outputs m st and {m c 1 t , . . . , m c K t }.

Set Embedding Module
The input to this module is a set of d-dimensional comment embeddings, {m c 1 t , . . . , m c K t }. The output is an action embedding vector, m at ∈ R h+d , which is invariant to the ordering of comment embeddings. The module consists of a single-layer LSTM with a hidden size of h, and a shared attention mechanism, f : R h × R d → R. The initial hidden state, q 0 ∈ R h , of the LSTM is a trainable vector. Inspired by (Vinyals et al., 2016), we perform L steps of computations over the comment embedding vectors. More specifically, at each step of computation l = 0, 1, . . . , L − 1: • The query vector, q l ∈ R h , is the current hidden state of the LSTM.
• Apply the attention mechanism to compute an attention coefficient, e i,l , between the query, q l , and a comment embedding, m c i t , for i = 1, . . . , K. In general, this framework is agnostic to the underlying attention mechanism. In this work, we closely follow the attentional setup in (Bahdanau et al., 2014), as shown in the following equation.
where v ∈ R h , W e ∈ R h ×d , and U e ∈ R h ×h . • Apply softmax function to normalize the attention coefficients, . (2) • Use the normalized attention coefficients to compute a weighted sum of the comment embedding vectors, as the readout in this computation step, • The LSTM takes q l and r l as input and computes the next hidden state, q l+1 , Note that swapping any two comment embedding vectors, m c i t and m c j t , will not affect the query vector q l as well as the attention readout r l . After L steps of computation, this module concatenates q L and r L to yield the final output action embedding, m at = [q L , r L ] ∈ R h+d .

Output Module
The input to this module is a state embedding vector, m st ∈ R d , and an action embedding vector, m at ∈ R h+d . We simply concatenate m st and m at and pass them through a fully-connected layer. The output is Q(s t , a t ; θ) ∈ R, which is the prediction of Q(s t , a t ).

Greedy Procedure
The next challenge that we need to address is to approximate the predicted optimal action in a combinatorial action space. Finding the predicted optimal action, argmax a Q(s t , a; θ), is intractable since it requires the prediction of all N K actions. In this work, we use a greedy procedure to compute an approximation. The complete procedure is shown in Algorithm 1. We start from an empty action, a t = ∅, and then iteratively adds into a t the comments that leads to the largest increase in Q(s t , a t ; θ), until |a t | = K. The procedure consists of K iterations. In each iteration i, we need to predict the Q-value for n − i actions. In total, it only requires the prediction of O(N K) actions, which is tractable. The advantage of this procedure over existing methods is that it does not require another pre-trained model to generate candidate actions.

Parameter Learning
We use the deep Q-learning algorithm (Mnih et al., 2015), which is a variant of the traditional Q-learning algorithm (Watkins and Dayan, 1992), to learn the model parameters of Q(s, a; θ) from the received rewards. The complete training procedure is shown in Algorithm 2 in the Appendix. The network parameters θ are first initialized arbitrarily. At each time step t, the agent selects an action a t according to the -greedy policy, receives a reward r t+1 , and transits to the next state s t+1 . Thus, it yields a transition tuple, ζ t = (s t , a t , r t+1 , s t+1 ). Instead of using the current transition tuple, ζ t , to update the parameters, we first store ζ t into an experience memory, D. This experience memory has a limited capacity, |D|, and the stored transition tuples are rewritten in a first-in-first-out manner. Then, we sample minibatches of transition tuples (s, a, r, s ) from D uniformly at random. Using the sampled transition tuples, we perform a step of stochastic gradient descent to minimize the following loss function, (5) where y = r + γ max a Q(s , a ; θ − ) is the Qlearning target, θ − are the network parameters of the Q-learning target. We update θ − to match the network parameters, θ, after every F time steps, where F is a hyperparameter.

Experimental Setup
In the experiments, we analyze the performances of different neural network models and different approximation procedures. First, their performances are evaluated on five real-world datasets, with a fix action size K. Then, we evaluate their performances with different action sizes, on one dataset. For each experiment setting, we do the following comparative analysis: • Compare the performance of our DRRN-Attention model with the baseline models using different approximation procedures.
• Compare the performance of the greedy procedure with the baseline approximation procedures using different neural network models.
• Find the approach (combination of neural network model and approximation procedure) that achieves the best performance.
Finally, we conduct a case study to better illustrate the difference between our DRRN-Attention model and the DRRN-BiLSTM baseline.

Datasets
All the experiments are conducted on discussion thread data from the Reddit discussion forum. In Reddit, threads are grouped into different categories, called subreddits, according to different discussion themes. Registered users are allowed to give up-votes or down-votes to a comment, these votes are then aggregated to compute a karma score for the comment. We use it as the reward for recommending that comment. Using the post IDs provided by He et al. (2016c), we crawl five datasets from five different subreddits respectively, including askscience, askmen, todayilearned, worldnews, and nfl. These subreddits cover a wide range of discussion topics and language styles. The basic statistics of the datasets are presented in Table 1. Since some of the posts and comments were deleted by Reddit, we remove all the deleted posts and comments from the datasets. Thus, the statistics of our datasets are different from that in He et al. (2016c). For each dataset, we use the simulator provided by He et al. (2016c) to partition 90% of the data as a training set, and 10% of the data as a testing set.

Evaluation
The evaluation metric is the cumulative reward per episode averaged over 1,000 episodes (He et al., 2016c). For each setting, we evaluate an agent as follows. First, we train the agent on the training set using Algorithm 2 in the Appendix for 3,500 episodes. Then, we test the agent using the testing set for 1,000 episodes and choose every action according to the -greedy policy, but the agent cannot use the received rewards to update the model parameters. We repeat the testing for five repetitions and report the mean and the standard deviation of the evaluation metric. Throughout the training and testing, we fix = 0.1.

Baselines
Our DRRN-Attention model is compared with two baselines. The first one is DRRN-BiLSTM, which is the current state-of-the-art model to approximate the Q-value for this task (He et al., 2016c). We modify the DRRN-BiLSTM model by replacing the Bi-directional LSTM with a mean operator and call this new model DRRN-Mean. This DRRN-mean is used as the second baseline model. In addition, we compare the greedy procedure with two baseline approximation procedures. The first is random sampling procedure in (He et al., 2016c). The second is the two-stage Q-learning procedure in , which is the stateof-the-art approximation procedure for this task.

Implementation Details
In preprocessing, we remove all punctuations and lowercase all alphabetic characters. To construct the bag-of-words representations, we use the dictionary provided by He et al. (2016c). This dictionary contains the most frequent 5,000 words in the data. All model parameters are initialized by a uniform distribution within the interval [−0.1, 0.1].
In our DRRN-Attention model, we set the comment embedding size d to 16, the hidden size of the LSTM h to 16, the hidden size of the attention mechanism h to 16, and the steps of attention L to 2. In the text embedding module of DRRN-Attention, each fully-connected layer has a hidden size of 16. In the baseline models, we set the hidden size of the bidirectional LSTM to 20, the comment embedding size to 20. The text embedding module of the baselines has two layers and each layer has a hidden size of 20. For the deep Q-learning algorithm, we set F = 1000. We update the model parameters using stochastic gradient descent with RMSprop (Tieleman and Hinton, 2012). We set different initial learning rates for different datasets (askscience: 0.00001; askmen: 0.00008; todayilearned, worldnews, and nfl: 0.00002). The mini batch size is 100. All the above hyperparameters are tuned by five-fold cross-validation. We also found that the model performances tuned by five-fold cross-validation are similar to that tuned by the testing set. We set the remaining hyperparameters according to . The memory size |D| is set to 10000. The discount factor γ is set to 0.9. The candidate size m of the baseline approximation procedures is set to 10.
8 Experimental Results

Agent Performances on Various Datasets
In this section, the performances of different neural network models and approximation procedures are evaluated on five datasets with N = 10, K = 3. The results are shown in Table 2. We analyze the performances of our DRRN-Attention model in each approximation procedure. With the random sampling procedure, our model achieves a higher cumulative reward than the baseline models across all datasets. When using the twostage Q-learning procedure, or the greedy procedure, our model outperforms the baselines on four of the datasets; its performance is competitive to the baselines in the remaining dataset. Next, we analyze the performances of the greedy procedure in each neural network model. When using the DRRN-BiLSTM model or the DRRN-Mean model to parameterize the Q-function, the greedy procedure achieves a higher cumulative reward than other two baseline procedures across all datasets. When using the DRRN-Attention model, the greedy procedure outperforms the baseline procedures in four of the datasets. In sum up, us-  ing DRRN-Attention model with the greedy procedure outperforms all the baselines in four of the datasets.

Agent Performances on Various Action Sizes
We evaluate all the neural network models and approximation procedures across various action sizes with K = 2, 3, 4, 5 and fix N = 10 on the askscience dataset. The results are presented in Table 3. Our DRRN-Attention model outperforms the baselines across all the action sizes form K = 2 to K = 5 with the random sampling procedure or the greedy procedure. With the twostage Q-learning procedure, our model achieves a higher cumulative reward than the baseline models when K = 2, 3, 4. Then, we analyze the performance of the greedy procedure in each neural network model. When using the DRRN-BiLSTM to parameterize the Q-function, the greedy procedure outperforms the baseline procedures when K = 3, 4, 5. When using our DRRN-Attention, the greedy procedure achieves a higher cumulative reward than the baselines when K = 2, 3, 5. Overall, using the DRRN-Attention model with the greedy procedure achieves the best performances across all the action sizes form K = 2 to K = 5. Table 4 presents an example of Q-value prediction of a state and three sub-actions on the askscience dataset. In this study, we enumerate every permutation of these three sub-actions, e.g., (1, 3, 2) denotes a permutation of comments that we place comment (1) in the first position, comment (3) in the second position, and comment (2)   ferent Q-value prediction. The Q-value prediction of the permutation (1, 3, 2) almost triples that of the permutation (2, 3, 1). On the other hand, any permutation of the comments does not change the Q-value prediction when we use our DRRN-Attention model. This example demonstrates that the ordering of the comments can significantly affect the predicted Q-value when we use the DRRN-BiLSTM model.

Discussions
As mentioned in Section 7.1, the datasets that we use have a fewer number of comments than the datasets used by previous work.   pares the number of comments in the datasets used by (He et al., 2016c) and us. The experiments in  used three of the datasets (askscience, askmen, and todayilearned) from (He et al., 2016c). We compare our results and the results reported in  of the DRRN-BiLSTM + two-stage Q-learning baseline on these three datasets in Table 6. On the askscience and todayilearned datasets, the results of our implementation are worse than the results reported by them.
Since the number of comments in our askscience dataset is only half of that in , the results of our implementation on the askscience dataset are significantly worse than the results reported by them. On the askmen dataset, the number of comments that we use is slightly less than the askmen dataset used by them. However, the results of our implementation on askmen are better than the results reported by them. We suspect that the deleted comments may have low karma scores, which cause the agent to achieve a higher cumulative reward.

Conclusion
In this work, we propose a new approach to the task of thread popularity prediction and tracking. In our approach, we propose a new neural network architecture, DRRN-Attention, to approximate the Q-function, which well respect the permutation invariance of the comments in an action. Moreover, our approach employs the greedy procedure to approximate the predicted optimal action, which does not require an additional pre-trained model to generate candidate actions. Empirical studies on real data demonstrate that our approach beats the current state-of-the-art in most of the experimental settings.