WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community

We present a corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations - including not only comments and replies, but also their modifications, deletions and restorations - this data offers an unprecedented view of online conversation. Our framework is designed to be language agnostic, and we show that it extracts high quality data in both Chinese and English. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. We illustrate the corpus’ potential with two case studies on English Wikipedia that highlight new perspectives on earlier work. First, we explore how a person’s conversational behavior depends on how they relate to the discussion’s venue. Second, we show that community moderation of toxic behavior happens at a higher rate than previously estimated.


Introduction
Compared to large-scale collections of conversations from social media (Felbo et al., 2017;Luo et al., 2012;Zhang et al., 2017;Tan et al., 2016) or news comments (Napoles et al., 2017), Wikipedia talk pages offer a unique perspective into goal-oriented discussions between thousands of volunteer contributors coordinating to write the largest online encyclopedia.Talk page data already underpins research on social phenomena such as conversational behavior (Danescu-Niculescu-Mizil et al., 2012, 2013), disputes (Wang and Cardie, 2014b), antisocial behavior (Wulczyn et al., 2017;Zhang et al., 2018) and collaboration (Kittur et al., 2007;Halfaker et al., 2009).However, the scope of such studies has so far been limited by a view of the conversation that is incomplete in two crucial ways: first, it only captures a subset of all discussions; and second, it only accounts for the final form of each conversation, which frequently differs from the interlocutors experience as the conversation develops.
In this paper, we undertake the challenge of reconstructing a complete and structured history of the conversational process in Wikipedia talk pages, containing detailed information about all the interlocutors' actions, such as adding and replying to comments, modifying or deleting them.To this end, we devise a methodology for identifying and structuring these actions, while also addressing the challenges spurring from the inconsistent formatting and the raw scale of existing records.This results in the largest public dataset of goal-oriented conversations, WikiConv, spanning five languages.The largest component of this dataset is based on the English Wikipedia, and contains roughly 91 million conversations consisting of 212 million conversational actions taking place in 24 million talk pages.
By including details about how each conversation evolved, this corpus provides an unprecedented view into the conversational process, as ex-perienced by the interlocutors.In fact, we find that about 40% of discussion activity would be missed by approaches that do not consider comment modifications and deletions, and even more is missed when only considering the (final) static snapshots of conversations.Furthermore, a manual review of the English Wikipedia portion of the dataset reveals that 98% of the reply structure is recovered correctly and 98% of the interlocutor's actions are categorized correctly.
Since the reconstruction pipeline does not rely on any language specific heuristics, we also apply it to Chinese, German, Greek and Russian Wikipedia Talk page archives, in addition to those from English Wikipadia.A manual review of the conversations obtained from the Chinese Wikipedia Talk pages shows a similarly high reconstruction accuracy with that obtained from the English Wikipedia, suggesting that it is reasonable to apply the reconstruction pipeline to different languages.To encourage further validation, refinements and updates, we have open sourced the code and published the datasets. 1inally, we present two case studies illustrating how the corpus can bring new insights into previously observed phenomena.We first analyze the conversational behavior of a subset of English Wikipedia contributors across the entire range of talk pages, and show that their levels of linguistic coordination vary according to where the conversation takes place.Second, we investigate the toxicity of deleted comments, and show that community moderation of undesired behavior takes place at a much higher rate than previously estimated.

Further Related Work
Past efforts aimed at characterizing conversations on Wikipedia talk pages have either focused on snapshots of discussion threads (Danescu-Niculescu-Mizil et al., 2012;Prabhakaran and Rambow, 2016;Wang and Cardie, 2014b,a), or have considered text segments in talk page history as incremental comments, ignoring conversational turns and reply structures within these conversations (Wulczyn et al., 2017).The limitations of these approaches can be seen in Figure 2, where, if we limit our analysis to only a snapshot of the final state of the conversation, we miss the abusive comment introduced in revision 3 and removed in revision 4, and thus miss an important part of the experience of the participants.In fact, this "hidden" activity accounts for one third of all actions taken on talk pages in English Wikipedia.
The closest dataset to our work is Bender et al. (2011) which introduces the Authority and Alignment in Wikipedia discussions corpus (AAWD), containing 365 talk page discussions.While acknowledging the complexity of conversational behaviors on Wikipedia talk pages, the AAWD work falls short of providing data on the deletions and follow-up changes to existing comments.Beyond addressing this shortcoming, the dataset we introduce in this paper is many orders of magnitude larger, containing 91 million conversations in English Wikipedia alone.

Conversation Reconstruction
Technically, comments are added to Wikipedia talk pages the same way content is added to article pages: contributors simply edit the markdown of any part of the talk page without relying on any functionality specialized for structuring the conversations.Figure 1 gives an example of the discussion interface and the resulting rendered conversation.Each edit results in a revision of the whole page that is permanently stored in a public historical record. 3 Because conversations on Wikipedia have no 'official' underlying structure, and instead are organized using indentation markup and other ad hoc visual cues, computational heuristics are necessary to interpret conversational structure.1: Summary statistics and reconstruction accuracy for the English and Chinese Wikipedia talk page corpora.These statistics exclude actions that result in empty content after markup cleaning (e.g., purely formatting edits).
Actions.We model the conversational structure of interactions as a graph of actions, as illustrated in Figure 2. Actions are categorized into five types: • Creation: A conversation thread is started by adding a markup section heading (e.g., Action 1 in Figure 2).• Addition: A new comment is added to a thread (e.g., Actions 2 and 3).
• Modification: An existing comment is modified (e.g., Action 5); the Parent-id indicates the original comment.
• Deletion: A comment or thread-heading is being removed (e.g., Action 4); Parent-id specifies the comment or thread-heading's most recent action.
• Restoration: A deletion is being reverted, returning to the state indicated by the Parent-id.All action types except thread creations, thread deletions and thread restorations also include a ReplyTo-id indicating the target of the reply.From Page Revisions to Actions.Our reconstruction pipeline is a Python program written for Google Cloud Dataflow (also known as Apache Beam)4 that operates on pages in parallel and on the revisions of each page sequentially in temporal order.
Due to the large scale of Wikipedia data, we use external sorting for pages that contains too many revisions to fit in a Dataflow worker's memory.When the number of revisions is too large for a Dataflow worker's local disk, the computation is performed in stages, a few years at a time.
Given the sorted set of a page-revisions, tokenlevel diffs between sequential revisions are computed using a longest common sequence (LCS) algorithm. 5Each sequential diff is then decomposed into the set of atomic conversation actions attributed to the user who submitted the page revision.During the sequential processing of a page's revisions, two data structures are maintained: each comment's current character offset, and a list of deleted comments.The comment offsets are used to interpret the difference between modification actions (edits within the bounds of an existing action) and additions; the deleted comments are used to identify restoration of comments.
We store the most recent 100 deleted comments between 10 to 1000 characters long, for each page.This is used to compute when a comment is restored by looking up deleted comments in a trie.The token length lower bound parameter avoids short commonly added commentslike "Thanks!"-from being interpreted as restorations.The upper bound ensures that occasional very long deleted comments are skipped, to bound Dataflow workers' memory usage.
Finally, reconstructed actions are processed using mwparserfromhell6 to clean the MediaWiki formating.Note that, since arbitrary page changes are allowed, some actions cannot be processed by the parser (about 1 in 200,000); in such cases, an action's raw MediaWiki markup is stored.
Table 1 shows summary statistics of the final dataset on English and Chinese Wikipedia.The version of the raw data dumps processed were retrieved on July 1st 2018.

Evaluation of Reconstruction Quality
We evaluate the quality of the automatic reconstruction by manually verifying a randomly drawn subset of (at least) 100 examples from each action category.For each action we verify the accuracy of (1) the assigned action type, (2) the token-level boundary of the comment, (3) the ReplyTo relation and (4) the action's Parent relation.
We conduct the evaluation for both English and Chinese data (Table 1).With over 98% of actions classified correctly in both languages, the dataset exhibits a high annotation quality given its scale and detail.From the error cases in the English data, 10% result from limitations in the current technologies for HTML parsing and LCS matching.User behavior that we could interpret but is not yet captured by our algorithm, such as moving ongoing conversations to another talk pages accounts for another 24%.The remaining errors were from edits that we were unable to interpret.By open sourcing the reconstruction code, we encourage further refinements.

Case Studies
We now briefly present two studies on English Wikipedia that highlight the importance of (1) collecting the full history of Wikipedia across all pages and (2) capturing the various types of interactions.Linguistic Coordination.Danescu-Niculescu-Mizil et al. ( 2012) studied language coordinations (i.e., in a conversation between a and b, to what degree is b systematically adopting a's language patterns when replying to a) on a conversational corpus derived from 5, 657 User Talk pages: those associated with, and managed by, a specific user.The study showed that social status mediates the amount of linguistic coordination, with contributors imitating more the linguistic style of those with higher status in the community.
We now show that the coordination pattern of the page owners in the previous dataset differs significantly based on where the conversation takes place.We compare each contributor's coordination patterns on their own user talk page to patterns exhibited on talk pages of other contributors, as well as to those on article talk pagestalk pages associated with a Wikipedia article.To avoid confounding different populations (and fall into the trap of Simpson's paradox), we only include in the comparison users that had a sufficient amount of contributions across all three venues.Figure 3 shows the three aggregated coordination values computed by applying the methodology of the original paper on 4 million addition actions that occurred before 2012.
Our results show with significant difference (p < 0.001 calculated by one-way ANOVA) that contributors coordinate the least when replying on other users' talk pages, and most on their own talk page.This leads us to speculate a new hypothesis: contributors have a different perception of status or respect on their own page than on others.Such questions, which require more thorough investigation that depends on observing how contributors interact across different discussion venues, can be studied using the WikiConv corpus.
Moderation of toxic behavior.Wulczyn et al. (2017) measured prevalence of personal attacks in a Wikipedia talk page corpus, and evaluated the fraction of attacks that moderators follow up on with a block or warning (17.9%).However, because there was no structured history of comment deletion, the authors were unable to measure the rate at which toxic comments are moderated through deletion.Using the more complete datasets provided by WikiConv we show that the fraction of problematic comments moderated by Wikipedians is significantly higher than their initial estimate suggests.We used the Perspective API 7 to score the toxicity of all addition and creation actions (which we refer to as "comments" here). 8Each comment is further classified as toxic or non-toxic according to the equal error rate threshold, following the methodology of (Wulczyn et al., 2017), where false positives are offset by false negatives.The threshold is calculated by on the human labels in the Kaggle Toxicity dataset of Wikipedia comments. 9Classification at this threshold yields 86% precision and 84% recall.
We used the same method to labeled comments with the severe toxic model.Figure 3 shows the fraction of comments deleted by Wikipedians who are not the author of the comment for different lengths of time; distinguishing between comments labeled as toxic, severely toxic, and the back-7 www.perspectiveapi.com 8We release the scores with the dataset. 9The Jigsaw Toxicity Kaggle Competition: goo.gl/N6UGPK ground distribution.The key observations here are that nearly 33% of toxic comments are removed within a day; and over 82% of severely toxic comments are deleted within a day.This complements results previously reported by Wulczyn et al. (2017), accounting for an additional type of community moderation that is revealed using the detailed information about the history of the conversation provided by our corpus.

Conclusion and Future Work
We introduced a pipeline that extracts the complete conversational history of Wikipedia talk pages at a level of detail that was not previously available.We applied this pipeline to Wikipedia in multiple languages and evaluated its quality on the English and Chinese Talk page corpora, obtaining a high reconstruction accuracy for both the Chinese and English datasets (98%).This level of detail and completeness opens avenues for new research, as well as for revisiting and extending existing work on online conversational and collaboration behavior.For example, while in our use cases we have focused on contributors deleting toxic comments, one could seek to understand why and when an editor is deleting or rewording their own comments.Beyond refining the heuristics and parsing methods used in our reconstruction pipeline, and reducing the time to update the corpus, a remaining challenge is to capture conversations that happen across page boundaries.

Figure 1 :
Figure 1: An example Wiki markdown and its rendered form from Wikipedia Talk Page Help. 2

Figure 2 :
Figure 2: Example conversation reconstruction.The action id in the ReplyTo column defines the conversation's structure; The Parent column indicates history, showing how actions change earlier actions.Note that each revision (color-coded) can introduce multiple actions.

Figure 3 :
Figure 3: (Left) Linguistic coordination depends on the discussion's venue.Error bars are estimated by bootstrap resampling.(Right) Deletion rate of content over varying time periods.