A Factored Neural Network Model for Characterizing Online Discussions in Vector Space

We develop a novel factored neural model that learns comment embeddings in an unsupervised way leveraging the structure of distributional context in online discussion forums. The model links different context with related language factors in the embedding space, providing a way to interpret the factored embeddings. Evaluated on a community endorsement prediction task using a large collection of topic-varying Reddit discussions, the factored embeddings consistently achieve improvement over other text representations. Qualitative analysis shows that the model captures community style and topic, as well as response trigger patterns.


Introduction
Massive user-generated content on social media has drawn interests in predicting community reactions in the form of virality (Guerini et al., 2011), popularity (Suh et al., 2010;Hong et al., 2011;Lakkaraju et al., 2013;Tan et al., 2014), community endorsement (Jaech et al., 2015;, persuasive impact (Althoff et Tan et al., 2016;Wei et al., 2016), etc. Many of these studies have analyzed content-agnostic factors such as submission timing and author social status, as well as language factors that underlie the textual content, e.g., the topic and idiosyncrasies of the community. In particular, there is an increasing amount of work on online discussion forums such as Reddit that exploits the conversational and community-centric nature of the usergenerated content (Lakkaraju et al., 2013;Althoff et al., 2014;Jaech et al., 2015;Tan et al., 2016;Wei et al., 2016;He et al., 2016a;, which contrasts with Twitter where the au-thor's social status seems to play a larger role in popularity. This paper focuses on Reddit, using the karma score 1 as a readily available measure of community endorsement. Some of the prior work on Reddit investigates specific linguistic phenomena (e.g. politeness, topic relevance, community style matching) using feature engineering to understand their role in predicting community reactions (Althoff et Jaech et al., 2015). In contrast, this paper explores methods for unsupervised text embedding learning using a model structured so as to provide some interpretability of the results when used in comment endorsement prediction. The model aims to characterize the interdependence of comment on its global context and subsequent responses that is characteristic of multi-party discussions. Specifically, we propose a factored neural model with separate mechanisms for representing global context, comment content and response generation. By factoring the model, we hope unsupervised learning will pick up different components of interactive language in the resulting embeddings, which will improve prediction of community reactions.
Distributed representations of text, or text embeddings, have achieved great success in many language processing applications, using both supervised and unsupervised methods. Unsupervised learning, in particular, has been successful at different levels, including words (Mikolov et al., 2013b), sentences (Kiros et al., 2015), and documents (Deerwester et al., 1990;Le and Mikolov, 2014). Studies have also shown that the learned embedding captures both syntactic and semantic functions of words (Mikolov et al., 2013a;Pennington et al., 2014;Levy and Goldberg, 2014;Faruqui et al., 2015a). At the same time, em-beddings are often viewed as uninterpretable -it is difficult to align embedding dimensions to existing semantic or syntactic classes. This concern has triggered attempts in developing more interpretable embedding models (Faruqui et al., 2015b), which is also a goal of our work. We leverage the fact that the structure of the distributional context impacts what is learned in an unsupervised way and include multiple objectives for separating different types of context.
Here, we are interested in linking two types of context with corresponding language factors learned in the embedding space that may impact comment reception. First, conformity to the topic and the language use of the community tends to make the content better accepted (Lakkaraju et al., 2013;Tan et al., 2014;Tran and Ostendorf, 2016). Those global modes typically influence the author's generation of local content. Second, characteristics of a comment can influence the responses it triggers. Clearly, questions and statements will elicit different responses, and comments directed at a particular discussion participant may prompt that individual to respond. Of more interest here are aspects of comments that might elicit minimal response or responses with different sentiments, which are relevant for eventual endorsement.
The primary contribution of this work is the development of a factored neural model to jointly learn these aspects of multi-party discussions from a large collection of Reddit comments in an unsupervised fashion. Extending the recent neural attention model (Bahdanau et al., 2015), the proposed model can interpret the learned latent global modes as community-related topic and style. A comment-response generation model component captures aspects of the comment that are response triggers. The multi-factored comment embedding is evaluated on the task of predicting the comment endorsement for three online communities different in topic trends and writing style. The representation of textual information using our approach consistently outperforms multiple document embedding baselines, and analyses of the global modes and response trigger subvectors show that the model learns common communication strategies in discussion forums.

Model Description
To characterize different aspects of language use in a comment, the proposed model factorizes a Figure 1: The structure of the full model omitting output layers, illustrating the computation of attention weights for b 2 and d 3 in a comment w 1:4 with its response r 1:4 . Purple circles a k and a j represent scalars computed in (1) and (6), respectively. ⊗ and ⊕ are scaling and element-wise addition operators, respectively. Black arrowed lines are connections carrying weight matrices. comment embedding into two sub-vectors, i.e. a local mode vector and a content vector. The local mode vector, computed as a mixture of global mode vectors, exploits the global context of a comment. In Reddit discussions that we use, the global mode represents the topic and language idiosyncracies (style) of a particular subreddit. More specific information communicated in the comment is captured in the content vector. The generation process of a comment is modeled through a recurrent neural network (RNN) language model (LM) conditioned on local mode and content vectors, while the global mode vectors are jointly learned during the training. Moreover, a residual learning architecture (He et al., 2016b) is used to extend the RNN LM for separating the information flow of the mode and the content vectors.
In addition to the global context, the full model further exploits direct responses to the comment in order to learn better comment embeddings. This is achieved by modeling the generation of comment responses through another RNN LM conditioned on response trigger vectors. The response trigger vectors are computed as mixtures of content vec-tors, with the idea that they will characterize aspects of the comment that incent others to respond, whether that be information or framing.
The full model is illustrated in Fig. 1. While the end goal is a joint framework, the model is described in the following two sub-sections in terms of two components: i) mode vectors for capturing global context, and ii) response trigger vectors for exploiting comment responses.

Mode Vectors
Using an RNN LM shown in the upper part of Fig. 1, we model the generation process of a word sequence by predicting the next word conditioned on the global context as well as the local content. The global context is encoded in the local mode vector, computed as a mixture of global mode vectors with mixture weights inferred based on content vectors. The local mode vector indicates where the comment fits in terms of what people in this subreddit generally say. It changes dynamically with the content vector as the comment generation progresses, considering the possibility of topic shifts or different broad categories of discussion participants.
Suppose there is a set of K latent global modes with distributed representations m 1:K ∈ R n . For the t-th word w t in a sequence, a local mode vector b t ∈ R n is computed as where c t ∈ R n is the content vector for the current partial sequence w 1:t , ⊗ multiplies a vector by a scalar, and the function a(c t , m k ) outputs a scalar association probability for the current content vector c t and a mode vector m k . The association function a(c, m k ) is defined as where U ∈ R n×2n and v ∈ R n are parameters characterizing the similarity between m k and c.
The computation of the association probability is the well-known attention mechanism (Bahdanau et al., 2015). However, unlike the original attention RNN model where the attended vector is concatenated with the input vector to augment the input to the recurrent layer, we adopt a residual learning approach (He et al., 2016b) to learn content vectors. For the t-th word w t in a sequence, the content vector c t under the original attention RNN model is computed as where x t ∈ R d is the word embedding for w t , b t−1 ∈ R n and c t−1 ∈ R n are previous local mode and content vectors, respectively, W ∈ R n×d and G ∈ R n×n are weight matrices transforming the input to the recurrent layer, and f (·, ·) is the recurrent layer activation function. To address the vanishing gradient issue in RNNs, we use the gated recurrent unit (Cho et al., 2014) for the RNN layer, i.e.
where is the element-wise multiplication, R is the recurrent weight matrix, and u and r are the update and reset gates, respectively. In this paper, we compute the content vector c t as follows: Comparing (2) and (3), it can be seen that we first aggregate the local mode vector b t−1 and the content vector c t−1 and treat the resulting vector Gb t−1 + c t−1 as the memory of the recurrent layer. The resulting hidden state vectors from the recurrent layer are content vectors c t 's. The use of residual learning is motivated by the following considerations. The local mode vector b t−1 can be seen as a non-linear transformation of c t−1 into a global mode space parameterized by m 1:K . If the global information carried in b t−1 is residual for generating the following word in the comment, the model only needs to exploit the information in local content c t−1 and learns to zero out the local mode vector b t−1 , i.e. G = 0. He et al. (2016b) show that the residual learning usually leads to a more well-conditioned model which promises better generalization ability. Finally, the RNN LM estimates the probability of the (t + 1)-th word w t+1 based on the current local mode vector b t and content vector c t , i.e.

Response Trigger Vectors
Another important aspect of comments in online discussions is how other participants react to the content. In order to exploit those characteristics, we use comment-reply pairs in online discussions and build this component upon the encoderdecoder framework with the attention mechanism (Bahdanau et al., 2015), which is illustrated in the lower part of Fig. 1. The decoder is essentially another RNN LM conditioned on response trigger vectors aiming at distilling relevant parts of the comment which other people are responding to.
Let r j denote the j-th word in a reply to a comment w 1 , · · · , w T . The decoder RNN LM computes a hidden vector h j ∈ R n for r j as follows, where W † ∈ R n×d and G † ∈ R n×n are weight matrices, x j is r j 's word embeddings from a shared embedding dictionary as used by the encoder RNN LM in Subsection 2.1, and d j−1 ∈ R n and h j−1 ∈ R n are the response trigger vector and hidden vector at the previous time step, respectively. The initial hidden vector h 0 is set to be the last content vector c T . With the comment's content vectors c 1 , · · · , c T obtained from the encoder RNN LM in Subsection 2.1, a response trigger vector d j is computed as the mixture: where a (h j , c t ) is a similar function to a(c t , m k ) defined in (1) with different parameters. Similar to the encoder RNN LM, the decoder RNN LM estimates the probability of the (j + 1)-th word r j+1 in the reply based on the hidden vector h j and the response trigger vector d j , i.e.
Note the decoder RNN only aims at providing additional supervision signals in training the encoder RNN through a response generation task. At test time, we do not use the responses therefore do not need to run the decoder RNN LM.

Model Learning
The full model is trained by maximizing the loglikelihood of the data, i.e.
where the two terms correspond to the loglikelihood of the encoder RNN LM and the decoder RNN LM, respectively, and α is the hyper parameter which weights the importance of the second term. In our experiments, we let α = 0.1. During the training, each comment-reply pair is used as a training sample. Considering that comments may receive a huge number of replies, we keep up to 5 replies for each comment. Due to memory limitations associated with the RNN, we use only the first 50 words of comments and the first 20 words of replies. If a comment has no reply, a special token is used. All weights are randomly initialized according to N (0, 0.01). The model is optimized using Adam (Kingma and Ba, 2015) with an initial learning rate 0.01. Once the validation log-likelihood decreases for the first time, we halve the learning rate at each epoch. The training process is terminated when the validation log-likelihood decreases for the second time. In our experiments, we learn word embeddings of dimension d = 256 from scratch. The number of modes K is set to 16. A single-layer RNN is used, with the dimension n of hidden layers set to 64.

Data and Task
In this paper, we work with Reddit discussion threads, taking advantage of their conversational and community-centric nature as well as the available karma scores. Each thread starts from a post and grows with comments to the post or other comments within the thread, presented as a tree structure. Posts and comments can be voted up or down by readers depending on whether they agree or disagree with the opinion, find it amusing vs. offensive, etc. A karma score is computed as the difference between up-votes and down-votes, which has been used as a proxy of community endorsement for a Reddit comment. Three popular subreddits with different topics and styles are studied 2 AskWomen (814K comments), AskMen (1,057K comments), and Politics (2,180K comments). For each subreddit, we randomly split comments by threads into training, validation, and test data, with a 3:1:1 ratio.  Task: Considering the heavy-tailed Zipfian distribution of karma scores, regression with a mean squared error objective may not be informative because low-karma comments dominate the overall objective. Following , we quantize comment karma scores into 8 discrete levels and design a task consisting of 7 binary classification subtasks which individually predict whether a comment's karma is at least level-l for each level l = 1, · · · , 7. This task is sensitive to the order of quantized karma scores, e.g., for the level-6 subtask, predicting a comment as level-5 or level-7 would lead to different evaluation results such as recall, which is not the case for a standard multiclass classification task. Additionally, compared to a standard multi-class classification task, these subtasks alleviate the unbalanced data issue, although higher levels are still more skewed. Evaluation metric: For each level-l binary classification subtask, we compute the F1 score by treating comments at levels lower than l as negative samples and others as positive samples. Note that we only compute F1 scores for l ∈ {1, . . . , 7} since no comment is at a level lower than 0. The averaged F1 scores is used as an indicator of the overall prediction performance.

Experiments
In this section, we evaluate the effectiveness of the factored comment embeddings on the quantized karma prediction task. We use the concatenation of the local mode vector and the content vector at the last time step as the factored comment embedding. First, we study the overall prediction performance of four different classifiers under two settings, i.e., using factored comment embeddings or not. Then we compare the factored comment embeddings inferred from the full model and its two   variants with other kinds of text features using the best type of classifiers. Finally, we carry out error analysis on prediction results of the best classifiers using the factored comment embeddings.

Classifiers
The following four types of classifiers are studied: • ShallowLR: A standard multi-class logistic regression model; • ShallowOR: An ordinal regression model (Rennie and Srebro, 2005), which can exploit the or-der information of the quantized karma labels; • DeepLR: A feed-forward neural network using the logistic regression objective; • DeepOR: A feed-forward neural network using the ordinal regression objective. These classifiers have different objectives and model complexities, allowing us to study the robustness of the learned comment embeddings. The factored comment embeddings are inferred from the proposed models trained on the same training data but independently with these classifiers.
As baselines, we train the classifiers using only content-agnostic features, as shown in Table 1, which have strong correlations with community endorsement (Jaech et al., 2015;. In our pilot work, we experimented with several groups of features from (Jaech et al., 2015) to find the content-agnostic features used in our paper. Since Jaech et al. (2015) work on a different task (ranking comments in a short time window), many of the useful content-agnostic features from (Jaech et al., 2015), including k-index, do not give additional improvement over the selected configuration for the karma prediction task.
All classifiers are trained on the training data for each subreddit independently, with hyperparameter tuned on the validation data. The penultimate weights are regularized using L 2 and the regularization parameters are selected in We report the prediction performance on the test data, as shown in Fig. 2. We observe that using comment embeddings consistently improves the performance of these classifiers. While Shal-lowOR significantly outperforms ShallowLR, indicating the usefulness of exploiting the order information in quantized karma labels, the difference is much smaller for deep classifiers. Also, deep classifiers consistently outperforms their shallow counterparts.

Text Features
We compare the factored comment embeddings with the following text features: • BoW: A sparse bag-of-word representation; • LDA: A vector of topic probabilities inferred from the topic modeling (Blei et al., 2003); • Doc2Vec: Embeddings inferred from the paragraph vector model (Le and Mikolov, 2014). For these models, which do not use RNNs, all words in a comment are used. We use the gensim implementations (Řehůřek and Sojka, 2010) for both LDA and Doc2Vec. For LDA, the number of topic is selected in {16, 32, 64}, and 32 works the best on the validation set for all subreddits. For Doc2Vec, we select the embedding dimension from {32, 64, 128}, and 64 works the best on the validation set for all subreddits. We train the Doc2Vec for 20 epochs, and the learning rate is initialized as 0.025 and decreased by 0.001 at each epoch.
In addition to the factored comment embeddings obtained from our full model, we study two variants of the full model: 1) a model trained without the mode vector component (Factored\M), which is a normal sequence-to-sequence attention model (Bahdanau et al., 2015), and 2) a model trained without the response trigger vector component (Factored\R). All textual representations are used together with the baseline content-agnostic features described previously.
Since the DeepLR and the DeepOR perform best across all subreddits and they have similar trends, we report results of the DeepOR in Tabel 2. Among all text features, the BoW has the worst averaged F1 scores and even hurts the performance for AskWomen, probably due to the data sparsity problem. Both the LDA and the Doc2Vec outperform the BoW. The Doc2Vec performs slightly better on AskMen and Politics, which might be attributed to the relative larger training data size. The factored comment embeddings derived from the full model consistently achieve better averaged F1 scores. It can be observed that the (a) w/o comment embeddings (b) w/ comment embeddings Figure 4: The confusion matrices for the DeepOR classifier on Politics. The color of cell on the i-th row and the j-th column indicates the percentage of comments with quantized karma level i that are classified as j, and each row is normalized.
two variants of the full model mostly lead to similar performance as the Doc2Vec, though the Factored\R embeddings usually have higher averaged F1 scores than the Factored\M embeddings. These results suggest advantages of jointly modeling two components, which may drive the model to discover more latent factors and patterns in the data that could be useful for downstream tasks.

Error Analysis
In this subsection, we focus on analyzing how factored comment embeddings improve the prediction results of the DeepOR classifiers. The F1 scores for individual subtasks are shown in Fig. 3. Note that the higher the level is, the more skewed the task is, i.e. a lower positive ratio. As expected, comments with the lowest endorsement level are easier to classify. Adding comment embeddings primarily boosts the performance of the classifier on the high-endorsement tasks (level > 5, 6) and the low-endorsement tasks (level > 0, 1).
Confusion matrices for the DeepOR classifier with and without factored comment embeddings are shown in Fig. 4 for Politics. Using the additional comment embeddings leads to a higher concentration of cell weights near the diagonals, corresponding to errors that mainly confuse neighboring levels. Without any text features, the classifier seems to only distinguish four levels. We observe similar trends on AskWomen and AskMen.

Qualitative Analysis
In this section, we conduct analysis to better understand what the factored model is learning, again using the Politics subreddit. First, we analyze latent global modes learned from the full model. For each global mode, we extract comments with top association scores. Note that the model assumes a locally coherent mixture of global modes and updates the mixture for each observed word. Thus, each comment receives a sequence of association probabilities over the global modes. The association score β k between a comment w 1:T and Mode-k is then computed as β k = max t∈{1,··· ,T } a(c t , m k ) for k ∈ {1, · · · , K}, where a(c t , m k ) is defined in (1). In Table 3, we show examples from the most coherent modes out of the 16 learned modes. Some modes seem to be capturing style (modes 2, 6, and 10), while others are related to topics (modes 7 and 16). Mode-2 captures the style of starting with rhetorical question to express negative sentiment and disagreement. Many comments in Mode-6 begin with words of drawing attention such as "bull" and "psst". Mode-10 tends to be associated with comments telling a story about a closely related person. Many comments in Mode-7 discuss low salaries, whereas Mode-16 comments discuss politicians or ideology of the Republican.
The characteristics of examples in modes 2 and 6 suggested that modes might have a location dependency, so we looked at word positions with the strongest association of each mode, i.e. argmax t∈{1,··· ,T } a(c t , m k ). For each Mode-k, we only keep comments with association score higher than mean(β k ) + std(β k ). Fig. 5 shows the box plot of locations where the strongest association happens. It can be seen that modes 2 and 6 usually have the strongest association at the beginning of a comment. For modes 3, 8, 15 and 16, the strongest associations occur over a wider span in comments. In addition to the interpretability of the learned modes as one can get from LDA, these observations suggest that our model may further capture word location effects which may help predicting community endorsement.
Next, we analyze the response characteristics by examining the response trigger vectors associated with the onset of comment responses, which is a special start-of-reply token. These response trigger vectors are clustered into 8 classes via k-means and visualized in Fig. 6 using t-SNE (van der Maaten and Hinton, 2008). For each cluster, we study the karma distribution, as well as comments together with the first reply. Related data statistics and examples are shown in Fig. A-4 and Tables A-2&A-3 in the supplementary materials. The horizontal dimension seems to be associated with how many replies a comment elicits. The vertical dimension is less interpretable but most clusters have identifiable traits. The far left classes (Class-1&4) both have few replies and low karma, often two-party exchanges where Class-4 has more negative sentiment. Class-2 comments tend to involve complements, whereas comments in Class-3 usually trigger a reply with but-clause for a contrast and disagreement intent. Comments in Class-5 mostly receive responses expanding on the original comments. Class-6 has a lot of sarcastic and cynical comments and replies. Comments in Class-7 are mostly anomalous since their first responses were usually [deleted]. It seems there are multiple response trigger factors in the proposed embedding model, some may reflect dialog acts and others sentiment, any of which may be helpful in predicting community endorsement.

Related Work
The skip-thought vector method (Kiros et al., 2015) is most closely related to our work in terms of utilizing context for unsupervised sequence modeling under the sequence-to-sequence framework (Sutskever et al., 2014). A key difference is the context being exploited. The skip-thought vector method uses surrounding sentences by abstracting the skip-gram structure (Mikolov et al., 2013a) from word to sequence. In our model, we exploit two types of context that are unique in online discussions: 1) the global context such as community topic and style which is learned in the mode vectors, and 2) the responses to a comment modeled as the response trigger vectors. Moreover, we augment our model with the attention mechanism (Bahdanau et al., 2015) to push the model to distill the relevant information from context.
The neural attention mechanism has been used for a variety of natural language processing tasks, e.g., machine translation (Bahdanau et al., 2015), question answering (Sukhbaatar et al., 2015), constituency parsing (Vinyals et al., 2015), social media opinion mining (Yang and Eisenstein, 2017). and dependency parsing . The attention mechanism developed in this paper for exploiting global modes differs from previous work in that the global modes being attended over are latent rather than explicitly observed, and in that they are learned jointly with the full model.
Predicting the community endorsement has been studied by using either hand-crafted features (Jaech et al., 2015) or neural models Zayats and Ostendorf, 2017), but all of them focus on supervised learning. Unsupervised learning strategies have been explored for characterizing different factors in language. A hierarchical Dirichlet process model was originally proposed for topic variations but has been extended to characterize multiple factors in (Huang and Renals, 2008). While much of the Dirichlet modeling work uses multinomial distributions, a loglinear version is introduced in (Eisenstein et al., 2011). Multi-dimensional structure latent factors in text are modeled by extending the sparsity-promoting topic model in (Paul and Dredze, 2012). Our model instead uses a neural network to characterize latent language factors, where the learned latent language factors could have a dependency on word positions.

Conclusion
This paper introduces a new factored neural model for unsupervised learning of comment embeddings leveraging two different types of context in online discussions. By extending the attention mechanism and using residual learning, our method is able to jointly model global context, comment content and response generation. Quantitative experiments on three different subreddits show that the factored embeddings achieve consistent improvement in predicting quantized karma scores over other standard document embedding methods. Analyses on the learned global modes show community-related style and topic characteristics are captured in our model. Also, we observe that response trigger vectors characterize certain aspects of comments that elicit different response patterns.
A potential future direction is to explore whether the comment embeddings derived from the unsupervised factored neural model can be useful across multiple tasks. Recently, a dataset with dialogue act annotations on Reddit discussions is published and can be used for a dialogue act prediction task (Zhang et al., 2017). Identifying or ranking persuasive arguments in the ChangeMyView subreddit (as studied in (Tan et al., 2016;Wei et al., 2016)) and asking for favors in the RandomActsOfPizza subreddit (used in (Althoff et al., 2014)) are also interesting for future work.