WikiTalkEdit: A Dataset for modeling Editors’ behaviors on Wikipedia

This study introduces and analyzes WikiTalkEdit, a dataset of conversations and edit histories from Wikipedia, for research in online cooperation and conversation modeling. The dataset comprises dialog triplets from the Wikipedia Talk pages, and editing actions on the corresponding articles being discussed. We show how the data supports the classic understanding of style matching, where positive emotion and the use of first-person pronouns predict a positive emotional change in a Wikipedia contributor. However, they do not predict editorial behavior. On the other hand, feedback invoking evidentiality and criticism, and references to Wikipedia’s community norms, is more likely to persuade the contributor to perform edits but is less likely to lead to a positive emotion. We developed baseline classifiers trained on pre-trained RoBERTa features that can predict editorial change with an F1 score of .54, as compared to an F1 score of .66 for predicting emotional change. A diagnostic analysis of persisting errors is also provided. We conclude with possible applications and recommendations for future work. The dataset is publicly available for the research community at https://github.com/kj2013/WikiTalkEdit/.


Introduction
Dialogue is a language game of influence, action, and reaction that progresses in a turn-taking manner. Persuasion occurs through dialogue when a listener favorably evaluates the authority, claims, and evidentiality through the cues and arguments made by the speaker (Krippendorff, 1993;Schulte, 1980;Durik et al., 2008).
Discussions on Wikipedia Talk pages can be useful for determining strategies that lead to an improvement of the article discussed, and for examining if they also lead to an amicable dialogic ex-change. Previous work (Yang et al., ,b, 2017 has explored the role of editors and the types of edits made on Wikipedia, but have not related them to the ongoing conversation on the Wikipedia Talk pages. We introduce the WikiTalkEdit dataset, a novel dataset for research in online collaboration. The dataset is a subset of the Wikipedia Talk Corpus available as of May 2018 1 . It contains 12,882 dialogue triples with labels about editors' subsequent editorial (editing) behavior, and 19,632 triplets with labels corresponding to editors' emotion as manifested in their replies. Table 1 has examples from the dataset. 2 This new dataset enables various language and behavior modeling tasks. In general, the dataset is important for understanding linguistic coordination, online cooperation, style matching, and teamwork in online contexts. More specifically, it offers linguistic insights about the norms on Wikipedia, such as (i) the feedback which is associated with a positive emotion vs a positive editing action, (ii) identifying and characterizing successful editorial coordination (Lerner and Lomi, 2019), (iii) generating constructive suggestions based on a given Wikipedia edit, and (iv) identifying and resolving disagreements on Wikipedia before they go awry (Zhang et al., 2018). In this study, we examine the first research problem. That is, we demonstrate how the dataset is helpful to compare and contrast the linguistic strategies that evoke favorable dialogic responses from those evoking behavioral compliance.

Related Work
Conversational quality is largely the focus of a body of work modeling the formal (Pavlick and Tetreault, 2016), polite (Niculae et al., 2015) and toxic (Zhang et al., 2018) features of comments on Wikipedia, Reddit, and other online public forums. The labels in such a task are often subjective as they depend mostly on annotated or crowdsourced labels. On the other hand, gauging the impact of a conversation in terms of a reader's subsequent behavior is a rather different problem. A few studies have modeled the language of arguments to predict their upvotes (Wei et al., 2016a,b;Habernal and Gurevych, 2016;Tan et al., 2016). The best result reported by Habernal and Gurevych (2016) was an F 1 score of .35 for the task of predicting which of two arguments was better, using SVMs and bi-directional LSTMs. The study by Tan et al. (2016) reported an accuracy of 60% for predicting which argument was most likely to change the original poster's (OP's) point of view. Althoff et al. (2014) report an AUC of .67 on predicting the success on ∼5700 requests. Studies predicting users' stance (Lin and Utz, 2015;Sridhar et al., 2015) have done better, but do not usually factor in the feedback from a turn-taking partner, during a dialogic exchange. Furthermore, to the best of our knowledge, we did not find an equivalent study to measure the actual subsequent behavior of a conversation partner after a dialogic exchange on social media platforms, forums, or Wikipedia.
In recent years, computational linguistics has developed computational models of dialogic text that predict the emotional responses associated with any utterance. The findings suggest that interacting speakers generally reinforce each others' point of view (Kramer et al., 2014;Rimé, 2007), use emotions to signal agreement, and mirror each other's textual cues (Niculae et al., 2015). On the other hand, predicting behavioral responses is potentially a more challenging task for text modeling and prediction, and it is also less explored in the literature.
The existing research on online turn-taking behavior has focused on modeling emotional reactions, with little interest in predicting actual behavioral change. This research is discussed in more detail in the Supplementary Materials 3 . For now, we contextualize the contributions of this dataset by demonstrating how it is applicable to address the following gaps in the scholarship: • How well do language models trained on editorial feedback predict subsequent emotional and editorial change? • What are the linguistic features of editorial feedback which predict emotional change in the person that initiates the discussion (henceforth, OP, original poster)? • What are the linguistic features of editorial feedback which predict subsequent editorial behavior by the OP?
First, we report the predictive performance on predicting emotional and editorial behavior change from the linguistic features of the comments, using regression baselines and state-of-the-art deep learning models. Performance is evaluated as an F 1 score of predicted labels against the ground truth labels as implemented in scikitlearn. Then, we compare the linguistic features associated with emotional change with those associated with subsequent edits. Finally, we offer a diagnostic analysis of the prediction errors observed.

The WikiTalkEdit dataset
In this dataset, we describe how we collected our data from the Wikipedia Talk dataset and formulated a task around emotional and behavioral actions of an article's editors, who are taking turns in a conversation.

Data generation process
After contributing to a Wikipedia article, the OP usually updates the Talk page with a summary of the edit. At this point, the OP may get zero or more responses, and they may respond to all, some, or none of them. To study the effect of editorial feedback, we defined a complete interaction between an OP and another Editor as a dialog triplet of the form OP → Editor → OP .
Our dependent variables are the OP's reaction to an Editor's comment in terms of the 'emotional change' in their language and their 'editorial change' in terms of subsequent edits to the Wikipedia article.
First we downloaded the entire Wikipedia Talk Corpus available as of May 2018 and extracted 128,231 dialogue triplets. Next, we used the Wikimedia API to download the edits corresponding to each of the OP's comments in our dataset of triplets. In the following paragraphs, we further describe how we operationalized the labels for the dataset. Emotional change: The emotional change label is the signed Euclidean distance between the positive and negative emotions of OP' and OP (see Figure 1). The positive and negative emotion measurements are calculated using the emotion dictionaries from the Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2007). The assigned labels were manually examined by the authors for face validity.
A change over one standard deviation above the mean is coded as '1' and is a positive emotional change. A change under one standard deviation below the mean is coded as a '0' and is a negative emotional change. All other values are marked "null" as there is no evident change in emotion. Editorial change: The edits, if any, performed by the OP to the article in the week following the Editor's feedback, are operationalized as a binary value ('1'=edit, '0'=no edit).

Dataset Analysis
In the following sections, we analyze what types of linguistic feedback from the Editor is effective at creating a positive emotional change or an editorial action by the OP.
In preliminary data explorations, we found no correlation between the Editor's politeness or status, and emotional or editorial change. We observed that the Editor's comments that are associated with positive comments (Mean = 273 characters, Mean Jaccard coefficient, JC = .16) are significantly shorter and have less overlap (content interplay) with the OP's comment than those associated with negative comments (Mean = 417 characters, Mean JC = .18). There was no substantial difference for editorial changes.

Predicting the response to Editor's Feedback
We examine the different linguistic features and discourse markers in predicting emotional and editorial change in the WikiTalkEdit dataset. Our independent variables comprise the linguistic features of the Editor's feedback, and the dependent variables are the OP's change in emotional and editorial behavior after receiving the feedback. We used logistic regression and many deep learning implementations from the pytorch-pretrainedbert package to predict both the emotional and editorial change of the user.

Feature extraction
We represented the Editor's feedback as a normalized frequency distribution of the following feature sets: -General lexical features (500 features and 50 LDA topics): The most frequent unigrams, and 50 topics modeled using Latent Dirichlet Allocation in Python's MALLET package, with α=5.
-Syntactic features (4 features): The Stanford parser was used to generate dependency parses. Dependency parses were used to identify and categorize all the adjectival modifiers that occurred at least ten times in the data. We distinguished the first-, second-, and third-person pronouns. Finally, we created part-of-speech n-grams based on the dependency trees.
-Social features (2 features): We measured content interplay, the Jaccard coefficient of similarity between the unigrams of the Editor's feedback and the OP's first comment. The Editor's status may pressure the OP to conform by performing edits; therefore, we quantified the Editor's experience in terms of their number of contributions to Wikipedia.

Deep learning models
We also experimented with a variety of deep learning baselines: • CNN: The CNN framework (Kim, 2014) involves applying convolutional filters followed by max-over-time pooling to the word vectors for a post. • RCNN: The RCNN framework (Lai et al., 2015), recurrent convolutional layers followed by max-pooling. A fully connected layer then follows it with a softmax for output. • biLSTM: The word embeddings for all words in a post are fed to bidirectional LSTM, followed by a softmax layer for output (Yang et al., 2016c). • biLSTM-Attention: For each sentence, convolutional and max-over-time pooling layers are applied on the embeddings of its words. The resultant sentence representations are put through bi-LSTM with the attention mechanism (Yang et al., 2016c). • NeuralMT: Embeddings are fed into a bidirectional-GRU followed by a decoder with the attention mechanism (Bahdanau et al., 2015). • FastText: Word representations are averaged into a sentence representation, which is, in turn, fed to a linear classifier (Joulin et al., 2017). A softmax function is used to compute the probability distribution over the predefined classes, and a cross-entropy loss is used for tuning. Hierarchical softmax is used to speed up the training process. • Transformer: The architecture implemented was based on recent previous work (Vaswani et al., 2017).

• OpenAI GPT: The Generative Pretrained
Transformer implementation (Radford, 2018) with the original hyperparameter settings.  • BERT and RoBERTa: The pre-trained BERT model (Devlin et al., 2018) and the Robustly optimized BERT model (RoBERTa) (Liu et al., 2019), where BERT is retrained with more data and an improved methodology. Models were fine-tuned using the simple transformers library. • XLNET: Finally, we evaluate the performance of XLNet (Yang et al., 2019), which combines bidirectional learning with the state-of-the-art autoregressive model such as Transformer-XL.
In the case of CNN and BiLSTM based models, we used the referral hyper-parameters from the original implementation for all models 4 . For Neural MT, FastText, and Transformer based models, implementations by original authors are used. All the models were evaluated using 5-fold crossvalidation with a split ratio of 80:20 for train and test set, respectively. In fine-tuning the RoBERTa model on editorial change, the model parameters included a learning rate of 9e -6 , 3 epochs, and train batch size of 8. For emotional change, model parameters include a learning rate of 1e -5 , 3 epochs, and a train batch size of 8. The maximum input sequence length was 128, which included 91% of all the inputs. The time taken was 6-8 minutes/epoch on Tesla k80, running on a Google Colab implementation. Five hyperparameter search trials were conducted with cross-validation. A manual tuning strategy was followed to identify the setting with the best performance.

Results
We now examine the test-set performance of these models trained on a subset of the WikiTalkEdit dataset. The dataset for emotion analysis comprises the 15% of overall dataset where editorial feedback yielded a substantial positive or negative change in the emotion vector (i.e., the emotional change was above or below one standard deviation from the mean). Similarly, the dataset for editorial actions (edits performed) comprises the 10% of the conversations that started within 24 hours since an OP edited the page. A pairwise correlation found no relationship between emotional and editorial change (ρ= .01, p>.1). The dataset statistics are provided in Table 2.

Predictive performance
Baseline logistic regression models: Table 3 shows that emotional change is more straightforward to predict than editorial change, and style provides marginally better predictive performance than content. The best performance was obtained using POS n-grams, with an F 1 score of .57 for predicting emotional change, and of .51 for predicting behavioral change. Unexpectedly, social features were not good predictors of emotional change. Deep learning models: In comparison to the logistic regression baselines, the deep learning models in Table 4 offer a remarkable predictive advantage, especially for emotional change. The best performing deep learning classifier is trained on pre-trained RoBERTa features and reports an F 1 score of .66 for emotional change and .54 for editorial change.   (Bahdanau et al., 2015) .59 .47 RCNN (Lai et al., 2015) .61 .36 FastText (Joulin et al., 2017) .65 .51 Transformer (Vaswani et al., 2017) .48 .50 OpenAI GPT (Radford, 2018) .64 .50 BERT (Devlin et al., 2018) .65 .52 RoBERTa (Liu et al., 2019) .66 .54 XLNet (Yang et al., 2019) .65 .53

Error analysis
We observed instances of misclassification from the best logistic regression classifier and the XLNet model (Yang et al., 2019). We have diagnosed the likely sources of errors in this section.

False positives in emotional change prediction
We randomly selected an assortment of false positives predicted by a logistic regression classifier and by XLNet and have provided them in Table 5 5 . First, we find that since the logistic regression methods rely heavily on stylistic features, the errors we identified seemed to occur when the style does not match the intended meaning: • Feedback about notability and relevance: In the first example in Table 5, we see that despite the polite feedback, the conversation was not resolved positively and resulted in negative responses. • Reverted edits: Similarly, in conversations where the OP contest their reverted edits, the dialogue appears to regularly derail into further negative replies despite the civility of the Editor's feedback.
The XLNet model did not repeat these particular errors. Its errors, on the other hand, appear to be driven by fact-checks and questions: • Fact-checks: In contradicting the OP with facts and personal opinions, a disagreement is sometimes implied but not obvious. The model predicts a positive emotional change, but the OP responds to the implication with a negative reaction. • Counter-questions: When Editors asked questions of the OP, it appears likely that the OP would turn defensive, even if the response included facts. Table 6 shows the false positives in predicting editorial change. Starting with the errors from models trained on stylistic features, we observed that in general, the errors centered on:

False positives in editorial change prediction
• Controversial topics: The errors arising from logistic classifiers reflect ideological disagreements, often involving hot-button topics such as race and ethnicity. The OP is not likely to change their mind despite what might be a well-reasoned argument from the Editor. • Reverted edits: Dialog around why edits were reverted, or content was removed are usually requests for greater clarity for documentation purposes, and are rarely followed up with edits to the page.
False positives in predicting editorial change by XLNet also appear to arise when feedback is nuanced. Aside from feedback that implicitly discourages further editing, similar to what was observed in Table 5, we also observed other types of feedback that leads to errors by the XLNet model: • Opinions: Editorial feedback that uses opinions rather than facts to persuade the OP appears to lead to an edit rarely, and this was a common error observed among the predicted labels. • Mixed feedback: The models also appear to get confused when the feedback included content from the page as a quote, and included suggestions but made no direct requests.

Linguistic insights
Based on the results in Table 3, in this section, we examine the stylistic, lexical, and topical features which best predict emotional and behavioral change. These findings offer us a way to examine whether emotional and editorial change are indeed different, and to compare the results against previous studies which have examined these problems in some capacity.

Stylistic insights
Comparing the most predictive stylistic and content features suggests that emotional and editorial change have different predictors. Table 7 summarizes the most significant predictors of emotional change based on an ordinary least squares regression analysis. Positive feedback through words related to rewards and positive emotions typically predict a positive emotional change, besides the use of stance words (the first person pronoun, I) and reference to past experiences (past tense). This finding is in line with the literature (Zhang et al., 2018;Althoff et al., 2014). Conversely, excessive use of adjectival modifiers (e.g., comparative words or words used to emphasize quantity or impact) is associated with a negative emotional change.
The insights look very different for editorial change (Table 8). Second person pronouns and present tense, both of which occur in directed speech, are associated with editorial changes, in sharp contrast with the features that emerged in the analysis of emotional change. Aligned with this, the use of words related to criticism (discrepancy) and work is also among the significant predictors of editorial change. Among the parts of speech, comments about the content (NN, NNP) appear to reduce the likelihood of an editorial change. Except for superlative modifiers, style seems not to be relevant in this case.
These results support previous studies in showing that emotion and politeness do not always signal editorial change (Hullett, 2005;Althoff et al., 2014), as it is true for stylistic markers (Durik et al., 2008), while direct requests (Burke et al., 2007), assertiveness, evidentiality (Chambliss and Garner, 1996) and other content-based features usually perform better. No feature appeared to correlate with both emotional and editorial behavior. Further lexical insights are provided in the Supplementary Materials. 6

Insights from topics
We conducted a Benjamini Hochberg (BH)corrected Pearson correlation of the topic features of comments by the Editor. We visualize it as a language confusion matrix introduced in recent work (Jaidka et al., 2020) to compare the topics predictive of emotional vs. editorial change of the OP. The word clouds in Figure 2 show the correlation of LDA topics with emotional change on the X-axis, and the correlation with editorial change on the Y-axis. The grey bands depict zones where the topics do not have a statistically significant correlation with either emotional or editorial change. We have distinguished the themes related to content (e.g., medicine, religion, and ethnicity) by coloring them in red. The topics in black are related to Wikipedia's content guidelines (i.e., mentions of NPOV, sources, cite, information). 7 These themes involve the neutrality (neutral point of view, NPOV), general importance (notability), and verifiability (sources, evidence) of information. Finally, the blue topics are meta-commentary centered around the elements in a Wikipedia article (mentions of edit, page, title, section).
Our analysis of the WikiTalkEdit dataset suggests that mentions of Wikipedia's guidelines are associated with a positive editorial change, but a      .10, p < .05) but they generally lead to editorial changes (.03 < r < .08, p < .05).

Discussion and Limitations
An exploration of the WikiTalkEdit dataset suggests that strategies that elicit a positive emotional change may not affect editorial behavior. Negative responses should not be the only yardstick to measure the successful outcome of a conversation. Editorial changes occur when Editors use interpersonal language in talking about evidentiality and notability. However, these strategies are also associated with a negative emotional change. Despite the apparent negative feedback, referencing norms and sources is a successful strategy to prompt behavioral compliance. In related work, social influence through mentioning community norms was more effective than the Editor's status at achieving compliance on Wikipedia; however, the latter was an important predictor in a similar modeling task on Reddit (Althoff et al., 2014). Although the findings would be correlational, there would be ways to establish cause and effect through a rigorous research design (Zhang et al., 2018). In some cases, the measurements may be thrown off if the replies to feedback are appreciative, but include some negative emotion words. Secondly, inordinately long or short feedback confounds the classifiers, but we expect that improvements in accuracy can be achieved by using differential attention models that focus on the emotions expressed in the first few words in the dialogic exchanges. Finally, we could encode the latent space with information about the type of editorial feedback (Yang et al., 2017), which would be helpful in predicting how the OP responds.

Conclusion and Future Applications
The WikiTalkEdit dataset offers insights that have important implications for understanding online disagreements and better supporting the Wikipedia community (Klein et al., 2019). We recommend the use of the WikiTalkEdit dataset to model the dynamics of consensus among multiple contributors. Scholars can also use the WikiTalkEdit dataset to address issues of quality, retention, and loyalty in online communities. For instance, the insights could shed light on how new OPs can be retained as sustaining Wikipedia contributors (Yang et al., 2017). Our exploratory analyses suggest that disagreements on Wikipedia arise over "errors": doubts that a given entry leaves no room for improvements. But errors serve a good faith purpose on Wikipedia by perpetuating participation and shared collective action (Nunes, 2011). The dataset would also be useful to understand how references are debated and interpreted as objective pieces of evidence (Luyt, 2015).