Modeling the Relationship between User Comments and Edits in Document Revision

Management of collaborative documents can be difficult, given the profusion of edits and comments that multiple authors make during a document’s evolution. Reliably modeling the relationship between edits and comments is a crucial step towards helping the user keep track of a document in flux. A number of authoring tasks, such as categorizing and summarizing edits, detecting completed to-dos, and visually rearranging comments could benefit from such a contribution. Thus, in this paper we explore the relationship between comments and edits by defining two novel, related tasks: Comment Ranking and Edit Anchoring. We begin by collecting a dataset with more than half a million comment-edit pairs based on Wikipedia revision histories. We then propose a hierarchical multi-layer deep neural-network to model the relationship between edits and comments. Our architecture tackles both Comment Ranking and Edit Anchoring tasks by encoding specific edit actions such as additions and deletions, while also accounting for document context. In a number of evaluation settings, our experimental results show that our approach outperforms several strong baselines significantly. We are able to achieve a precision@1 of 71.0% and a precision@3 of 94.4% for Comment Ranking, while we achieve 74.4% accuracy on Edit Anchoring.


Introduction
Comments are widely used in collaborative document writing as a natural way to suggest and track changes, annotate content and explain the intent of edits. Table 1 shows an example of a comment left by an editor in a Wikipedia revision. The comment explains the intent of adding a missing researcher's name (Sutskever) to a citation. * The work was done while these authors interned in Microsoft Research. This example demonstrates that user comments closely relate to the intent of an edit, the actual edit operation, and the content and location in the document that underwent change. However, during collaborative document authoring, the tracking and maintenance of comments becomes increasingly more challenging due to the large number of edits and comments that authors make. For example, structurally refactoring a document can significantly change the order of paragraphs and sentences, stranding comments in confusing and contextually inappropriate locations. Or, comments may have already been addressed but continue to linger in the document without having been marked as completed. These issues, among others, are exacerbated when multiple authors collaborate on a document, often asynchronously. It becomes difficult to know which tasks have been completed and which haven't, especially if authors are not proactive about marking comments as addressed. This situation stands to benefit from an intelligent system capable of marking changes as completed, by understanding the relationship between edits and comments.
Many tasks in document management, such as summarizing and categorizing edits, detecting todo item completion, prioritizing writing tasks and visually re-arranging document structure to reflect state in multi-author scenarios, require users to understand edit-comment interactions. Consider the scenario where multiple authors make edits to a document using the track-change feature available in popular document authoring apps. The result can be extremely confusing, and it is difficult to disentangle who edited what, over multiple versions. A feature that could summarize these edits, so that the visual burden of tracking changes is not on the UI, would certainly alleviate this problem. Such a system would necessarily first need to learn mappings between edits and natural lan-guage comments, before it could learn to generate them. Therefore, automatic solutions to these challenges stand to benefit from an ability to fundamentally model the relationship between edits and comments.
Yet, most existing studies on document revisions focus on modeling the document edits (Bronner and Monz, 2012) or comments (Shaver, 2016) separately; or using comments as a supplementary source of information to study the edits (Yatskar et al., 2010). To the best of our knowledge, no prior work focuses on jointly modeling the relationship between comments and edits.
Thus, in this paper we tackle two novel tasks. Comment Ranking considers a document edit operation and seeks to rank a set of candidate comments in order of their relevance to the edit. Edit Anchoring seeks to identify the locations in a document that are most likely to have undergone change as the result of a specific comment. Crucially, both tasks require jointly modeling comment-edit relationship.
We start by collecting a dataset of 780K Wikipedia revisions, each with their associated comment and edits. We then build a hierarchical multi-layer deep neural network, a model we call CmntEdit, which is capable of learning the relationship between comments and edits from this data. Our approach addresses both Comment Ranking and Edit Anchoring by sharing many of the model's components and parameters across tasks. Since edits can apply to discontiguous sequences of text, which pose a challenge for sequence modeling approaches, we explore novel ways to represent a document both before and after an edit, while also accounting for contextual information surrounding an edit. To differentiate the context from edit words, we also explore a novel mechanism to encode edit operations such as additions and deletions explicitly in the model. 1 Finally, we evaluate our model on both tasks and in a number of experimental settings, demonstrating that our solution is significantly better than several strong baselines on jointly capturing the relationship between edits and comments. Our model outperforms the best baseline by 34.6% on NDCG for Comment Ranking and achieves a best score of 0.687 F1 for Edit Anchoring. Addition- 1 We are making the code for the CmntEdit model, and generation of our dataset publicly available to the research community at https://github.com/microsoft/ WikiCommentEdit.

Pre-edit Version
In October 2012, a similar system by Krizhevsky and Hinton won the large-scale ImageNet competition by a significant margin over shallow...

Post-edit Version
In October 2012, a similar system by Krizhevsky and Sutskever and Hinton won the large-scale ImageNet competition by a significant margin over shallow... Table 1: Example of an edit and its associated comment. The added words "Sutskever and" in post-edit version is marked in red.
ally, in an ablation study we demonstrate that our various modeling choices, which tackle the inherent challenges of comment-edit understanding, each contribute positively to empirical results.

Related Work
Document revisions have been the subject of several studies in recent years (Nunes et al., 2011;Fischer, 2013). Most prior work focuses on modeling document edits only. For instance, Bronner and Monz (2012) build a classifier to distinguish fluency edits from factual edits. Zhu et al. (2017) study the semantic distance between the content in different versions of documents to detect document revisions. Grossman et al. (2013) propose a hierarchical navigation method to display document revision histories.
Some work utilizes comments associated with document edits as supplementary information to study the document revisions. For example, Yatskar et al. (2010) consider both comment and document revision for lexical simplification. However, they use comments as meta-data to identify trusted revisions, rather than directly modeling the relationship between comments and edits. Yang et al. (2017) featurize both comments and revisions to classify edit intent, but without explicitly modeling edit-comment relationship.
Wikipedia revision history data (Nunes et al., 2008) has been used in many NLP tasks (Zesch, 2012;Max and Wisniewski, 2010;Ganter and Strube, 2009). For instance, Yamangil and Nelken (2008) model Wikipedia revision histories for improving sentence compression, Aji et al. (2010) propose a new term weighting model leveraging Wikipedia revision histories, and Zanzotto and Pennacchiotti (2010) expand textual entailment corpora from Wikipedia revision histories using co-training. Again, however, none of these meth-ods directly consider or model the relationship between comments and edits.
At a basic level, modeling the connection between comments and edits can be seen as a text matching problem, with superficial similarity to other common NLP tasks, such as Question Answering (Seo et al., 2016;Yu et al., 2018), document search (Burges et al., 2005;Nalisnick et al., 2016), and textual entailment (Androutsopoulos and Malakasiotis, 2010), among others. Note however, that edits are a (possibly discontinuous and non-trivial) delta between two versions of a text, making their representation and understanding more challenging than that of a simple string. We demonstrate this in our evaluation in Section 5.2, where we compare against several competing models that were designed for other text matching challenges.

Dataset
Our dataset -which we call WikiCmnt -is generated from Wikipedia revision histories. In the absence of publicly available document data, Wikipedia is a particularly rich resource: (i) It maintains full revision histories of every Wikipedia page, along with associated editor comments as meta-data. (ii) It is a large-scale instance of multi-author collaboration, with many editors contributing to and maintaining pages.
The specific historical dump we use is from May 1, 2018. It contains approximately 52.7 million pages, and 755.5 million unique revisions made by 300.8 million users. WikiCmnt is a subsample of 786,866 Wikipedia page revisions along with their associated metadata. Revisions are filtered out before sampling, if they violate any one of the following criteria: (i) The length of the comment is longer than 8 words. (ii) The edits made to the Wikipedia page span more than one section 2 . (iii) The Wikipedia page has an edit history containing fewer than 10 unique revisions.
We extract and store a number of data fields from the Wikipedia revision history as summarized in Table 2. For each specific revision of a page, we not only retrieve the text of the comment and edit but also sample 10 non-related comments (Neg-Cmnts) and 10 non-related edits (Neg-Edits) from the same page's history. Finally we also encode the individual edit operations in both pre-edit 2 https://en.wikipedia.org/wiki/Help: Section l cooked Edit: changing the phrase "chicken wing" to "salmon". Specifically, the values -1, 1 and 0 are used to represent deletions, additions and unchanged tokens, respectively.  and post-edit versions of a text. For example, consider Figure 1, which shows the edit action encoding associated with an change from "chicken wing" to "salmon".

Proposed Model
To model the relationship between edits and comments, we first formulate the problem with respect to the two tasks of Comment Ranking and Edit Anchoring; we then provide details of the neural architecture used to solve these problems; finally we provide some implementation details. We begin with preliminary notation. Let us define c as a comment consisting of word tokens {w 1 , ..., w q }. Minimally, let us also define a pre-edit e s , as the contiguous sequence of words spanning the first to the last edited token (with possibly intervening tokens that are not edited) in a document's pre-edit version. The edit e s may optionally also contain some surrounding context. Formally this can be defined as: where i and i + p are the indices of the first and the last edited tokens in a revision, and k is the context window size. If there is more than one contiguous block of edited tokens, these blocks are concatenated to form the set of edit words with their context words.
We define a document edit as the pair e = {e s , e t }, where e t is similarly defined over edited tokens and their context words in a document's post-edit version.

Problem Formulation
Comment Ranking is the task of finding the most relevant comment among a list of potential candidates, given a document edit. The inputs of the comment ranking task are some set of user comments C = {c 1 , c 2 , . . . c m } pertaining to a document, and an edit e = {e s , e t } in the same document.The goal of the task is to produce a ranking on C, such that the true comment c i with respect to the edit e = {e s , e t } is ranked above all the other comments (i.e. distractors). We use standard ranking metrics to evaluate model performance: Precision@K (P@K), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). Edit Anchoring is the task of finding the sentences in a document that most likely underwent change, given a specific user comment. The inputs to the task are a user comment c and a list of candidate sentences S = {s 1 , s 2 , . . . s n } in the postedit document e t . Unlike with Comment Ranking, we operate under the assumption that an edit has already been completed, and therefore discard the information from the pre-edit version e s . In the ground truth, at least one (but possibly more) of the sentences is an edit location for the comment c. The expected output is a list of binary classifications R = {r i } n i=1 , where r i = 1 indicates that the sentence s i is a likely edit location, given comment c. We use Accuracy and F1 score to evaluate performance on this task.

Model Overview
For both tasks, our models are hierarchical deep neural networks with four layers: an input embedding layer, a contextual embedding layer, a comment-edit attention layer, and an output layer. The overall architecture is shown in Figure 2. We describe each of the four layers in what follows. Since both models share many components we will describe the more general case that covers all inputs for Comment Ranking; for Edit Anchoring  Input Embedding Layer. The input embedding layer maps each word in user comments c and edits e = {e s , e t } to a high-dimensional vector space. The output of the input embedding layer are matrices: U ∈ R d×M representing the preedit document, V ∈ R d×M representing the postedit document, and Q ∈ R d×J representing the comment. Here M is the length of edits and J is the length of the comment, while d is the fixedlength dimension of word vectors.

Input of Comment Ranking
Contextual Embedding Layer. We use a bidirectional Gated Recurrent Unit (GRU) (Chung et al., 2014) to model the sequential interaction between words. Operating over the output of the previous layer we obtain the contextual embedding matrices U c ∈ R 2d×M and V c ∈ R 2d×M for both pre-and post-edit versions. Also, we obtain Q c ∈ R 2d×J from the comment word vectors.
Note that the row dimension of contextual matrices U c , V c and Q c are 2d because of the concatenation of the GRU's output in both forward and backward directions.
Comment-Edit Attention Layer. Inspired by the attention mechanisms utilized in machine comprehension (Seo et al., 2016;Yu et al., 2018), the comment-edit attention layer is designed to capture relationships between document edit and comment words. The attention layer maintains and processes both pre-and post-edit documents separately. This is to reduce the information loss that would have occurred if their representations were fused before the attention layer. Additionally, this layer incorporates an action encoding vector, which is designed to reflect the three kinds of edit operations: adding a word, deleting a word, or leaving it unchanged.
The inputs to the layer are the contextual matrices U c and V c of the pre-and post-edit documents respectively, the matrix Q c representing the comment, and the supplemental action encoding vectors a † , a ‡ ∈ Z M which encode the edit operation each token undergoes in the pre-and post-edit documents, respectively. The output is the commentaware concatenated vector representations of the edit words in both pre-and post-edit documents, Internally, we first calculate the shared similarity matrix S † ∈ R M ×J between the comment Q c and contextual matrix U c of pre-edit documents, while also accounting for the action encoding vector a † . The elements of this shared similarity matrix are defined as follows: where G is a trainable function that generates the similarity between the word-level representations of comments and edits with respect to an edit operation.
Here U c :i ∈ R 2d×1 is the vector representation of the i-th word in the pre-edit document and Q c :j ∈ R 2d×1 is the vector representation of j-th word in the comment. a † i ∈ {−1, 0, 1} is the action encoding for the edit operation performed on the i-th word in the pre-edit document. We choose the trainable function G(u, q, a) = w T [u ⊗ q; a], where w ∈ R (2d+1)×1 is a trainable weight vector, ⊗ is the element-wise multiplication operator and [; ] represents the vector concatenation across the row dimension. Similarly, we can calculate the shared similarity matrix S ‡ ∈ R M ×J between the comment and contextual matrix of the post-edit document as S ‡ ij = G(V c :i , Q c :j , a ‡ i ). Note that the weight vectors in function G are different for preand post-edit versions, and both are trained simultaneously in the model.
After the similarity matrices are computed, we use them to generate the Comment-to-Edit Attention (C2E) weights. C2E is used to represent the relevance of words in the edit to those that appear in the comment. This is critical for modeling the relationship between comments and edits.
We obtain the C2E attention weights c † ∈ R M for edit words in the pre-edit document by taking c † = softmax(max col (S † )), where the max col (·) operator finds the column-wise maximum value from a matrix. Similarly, for the post-edit document, we have c ‡ = softmax(max col (S ‡ )).
Finally we multiply the attention vectors to the previously computed similarity matrices for both pre-and post-edit documents, and concatenate the results to obtain the relevance vector h ∈ R 2J : The vector h intuitively captures the weighted sum of the relevance of the comment with respect to the edits in both pre-and post-edit documents.
Output Layer and Loss Function. The output layer and loss function of the network is taskspecific. Comment Ranking requires ranking the relevance of candidate comments given a document edit. Broadly we obtain a ranked list by computing the relevance score between comments and edits by the output r = β T h, where β is a trainable weight vector.
A data sample i in Comment Ranking consists of one true comment-edit pair and n i negative comment-edit distractors. We denote r + i as the relevance score of the true pair, and r − ij as the relevance score of the j-th distractor pair (with 1 ≤ j ≤ n i ). The goal of our task is to make r + i > r − ij for all j. We therefore set the loss function to be a pairwise hinge loss between true and distractor relevance scores.
where Θ is the set of all trainable weights in the model and N is the number of training samples in the dataset. For Edit Anchoring, the goal is to predict whether a sentence in the document is likely to be the location of an edit, given a comment. This is a binary classification problem, and we can suitably set the output layer as p = softmax(γ T h) -the probability of predicting the positive class. Here γ is a trainable weight vector.
Given the binary nature of the classification problem we can use the cross entropy loss: where Φ is the set of all trainable weights in the model, N is the number of data samples, m i is the number of sentences in the i-th data sample, p ij is the predicted label of the j-th sentence in the i-th data sample, and y ij is the corresponding ground truth label.

Implementation Details
The CmntEdit model described in this section is implemented using the Pytorch 3 framework and trained on a single Tesla P100 GPU with 16GB memory. The word vectors are initialized with pre-trained Glove embeddings (Pennington et al., 2014) using the default dimensionality of 100. We set the number of training epochs to 5, the maximum length of a comment to 30 tokens and the maximum length of an edit to 300 tokens. For the Comment Ranking task, we set the batch size to 10 and consider 5 candidate comments in each data sample: one true comment and 4 distractors. We thus have 5 comment-edit pairs for each data sample and 50 pair-wise samples for each training batch. For the Edit Anchoring task, we also set the batch size to 10 and consider 5 candidate sentences, which yields an identical 50 training instances per batch. It should be noted that while we train our model with only 5 candidate comments or sentences (for Comment Ranking or Edit Anchoring respectively), the models -once trained -can be applied to any number of candidates at test time for either task.

Experiment
In this section, we show evaluation results of our model on the previously collected Wikipedia dataset 3. We begin by introducing the experimental settings in Section 5.1. We then compare the performance achieved by the proposed method against several baseline models in Section 5.2. We

Experimental Setup
We begin by introducing the evaluation setting, metrics and baselines we use in our experiments.

Datasets and Labels
We use the WikiCmnt dataset described in Section 3 for training and evaluation. The dataset contains 786,886 data samples in total. We reserve 70% of the data for training and split the remainder between 10% for validation and 20% for testing. For the Comment Ranking task, we always have a single true comment, but in our test we experiment with both 4 and 9 distractors, yielding a total of 5 and 10 candidate comments. Similarly for the Edit Anchoring task, we also test with both 5 and 10 candidate sentences.

Evaluation Metrics
We use common ranking evaluation metrics from the literature to evaluate models on the task of Comment Ranking. These include: (i) Preci-sion@K. The proportion of predicted instances where the true comment appears in the ranked top-K result, with K = 1, 3, 5. (ii) Mean Reciprocal Rank (MRR). The average multiplicative inverse of the rank of the correct answer, represented mathematically as MRR = 1 where N is the number of samples and rank i is the rank assigned to the true comment by a model. (iii) Normalized Discounted Cumulative Gain (NDCG). the normalized gain of each comment based on its ranking position in the results. We set the relevance score of the true comment to one and those of the distractors to zero. For the Edit Anchoring task, we use Accuracy and F1 Score as evaluation metrics. These are computed based on a model's classification of candidate sentences in the post-edit version of the document.

Baseline Methods
For the Comment Ranking task, we compare our approach to the following baselines: (i) Cosine-TfIdf uses the cosine similarity between the TfIdf-weighted vectors of the comment and edit as a measure of relevance. (ii) Cosine-In-ferSent computes the cosine similarity between  Table 3: Performance on Comment Ranking comment and edit vectors generated by the state-of-the-art sentence embedding method In-ferSent (Conneau et al., 2017). (iii) RankNet (Burges et al., 2005) minimizes the number of inversions in ranking by optimizing a pairwise loss function. We use a previous neural network implementation 4 with default settings. (iv) LambdaMART (Burges, 2010) leverages gradient boosted decision trees with a cost function derived from LambdaRank (Burges et al., 2007) for solving a ranking task. We use an existing python implementation 5 , with a learning rate of 0.02 and 10 max leaf nodes. (v) Gated Recurrent Neural Network (Chung et al., 2014) models the sequence of words in comments and edits using pre-trained GloVe vectors as embedding units. Three fully-connected layers compute a final relevance score between comments and edits. For the Edit Anchoring task, we chose the following popular classifiers as baselines: (i) Random Forest (Liaw et al., 2002) (ii) Adaboost (Rätsch et al., 2001) (iii) Passive Aggressive classifiers (Crammer et al., 2006) (iv) Gated Recurrent Neural Network. The features used in these models are based on both TfIdf scores, as well as sentence embedding features generated by In-ferSent. The Gated RNN model is trained with a task-specific fully connect layer for Edit Anchoring to compute the classification probability.

Model Variants
We train and evaluate several different variants of our neural architecture.

Experimental Results
We now present and discuss the empirical results of our evaluation on both Comment Ranking and Edit Anchoring. Table 3 summarizes results of the Comment Ranking task. Our models significantly outperform all the baselines in every setting and on all metrics. The results are statistically significant at p < 0.01 using the Wilcoxon signed-rank test (Smucker et al., 2007). Since the Comment Ranking task only has one true comment, the other comments being distractors, the P@1 result becomes especially important for practical applications. Our approach achieves 71% precision, which is significantly better than the 31% precision of the best baseline method (LambdaMART) with 5 candidate comments. Our model similarly outperforms the best baseline with 10 candidate comments, ob-  taining a P@1 score of 59.3%. Additionally, the higher scores on MRR and NDCG indicate that our approach consistently ranks the true comment higher than the baseline methods. The performance of CmntEdit-MT is 2% and 5.3% worse than CmntEdit-CR on NDCG and P@1, respectively. Surprisingly, this suggests that training our model in a multi-task fashion leads to slightly lower scores, and a model specifically trained for the individual task of Comment Ranking is to be preferred. Table 4 shows the results for Edit Anchoring. Our method, CmntEdit-EA, outperforms the best baseline method, Gated-RNN, by 5.5% on F1 and 6.9% on accuracy. The improvements over all the baselines are statistically significant at a p-value of 0.01. The baseline classifiers including Passive-Aggressive, Random Forest and Adaboost have high accuracies, but low F1 scores. This is because of the imbalance between positive and negative samples in our data. Specifically, the number of negative samples is 4 times greater than the number of positive samples when the size of the candidate set is 5 -and even greater when it is 10. Therefore, the baseline classifiers tend to naively predict a negative label, which artificially boosts precision to the detriment of recall. In fact, Adaboost actually outperforms our models on accuracy when the candidate set size is 10, but yields a much lower F1 score.

Edit Anchoring
As with Comment Ranking, the performance of CmntEdit-MT is slightly worse than CmntEdit-EA. Within the scope of our problem space, this reinforces the finding that targeted training seems to work better than joint training.

Ablation Study
To verify the effectiveness of our modeling choices, we evaluate performance in the absence of each of the following model components: 1. w/o Action: we remove the action encoding from the trainable function G and instead simply use G(u, q) = w T [u T q]. 2. w/o Attention: we remove the edit-based attention from Equation (2). Instead, we generate the relevance vector h as follows: h = mean col (S † ); mean col (S ‡ ) T . 3. w/o Hadamard: we use the inner product instead of the Hadamard product in the trainable function G as follows: Results in Table 5 show that each component improves the overall performance on both Comment Ranking and Edit Anchoring tasks, across our evaluation metrics. This indicates that our modeling choices are particularly suited to tackle the inherent challenges involved in modeling comment-edit relationship. Table 6 provides a few output examples from our model on the Comment Ranking task, demonstrating its ability to learn abstract connections between comments and edits. Due to space constraints, only one illustrative distractor is shown.

Qualitative Evaluation
The first example shows an edit summarized by the high-level comment "date and capitalization corrections". This comment is correctly assigned the highest relevance score by our model, despite the fact that no words are shared between the comment and edit. Meanwhile, one of the distractors has a lower score even though it shares the lexical item "Walgreens" with the context of the edit.
In the second example an entire sentence is removed by the editor. Again, although no words are shared between the comment and the edit, our model is correctly able to identify the delete operation, possibly by learning the common Wikipedia shorthand for deletions "rm". Meanwhile, one of the distractors contains the phrase "St Helens", which also appears in the edit, but is still assigned a lower score.

Conclusion and Future Work
In this paper, we have explored the relationship between comments and edits by defining two novel tasks: Comment Ranking and Editing Anchoring. In order to model the problem we collected a dataset with over 780K comment-edit pairs. Fur-  ther we proposed a hierarchical multi-layer neural network capable of tackling both our proposed tasks by encoding specific edit actions, such as additions and deletions, as well as document context. In our experiments we show that our approach outperforms several baselines by significant margins on both tasks, yielding a best score of 71% pre-cision@1 for Comment Ranking and 74.4% accuracy for Edit Anchoring.
In future work we plan to explore sequences of revisions through the lifecycle of a document from creation to completion, with the ultimate goal of modeling document evolution. We also hope to apply our modeling approach to practical downstream applications, including: i) detecting completed to-dos based on related edits; ii) localizing the paragraphs that could be edited to address a given comments; iii) summarizing document revisions.