Towards Modeling Revision Requirements in wikiHow Instructions

wikiHow is a resource of how-to guides that describe the steps necessary to accomplish a goal. Guides in this resource are regularly edited by a community of users, who try to improve instructions in terms of style, clarity and correctness. In this work, we test whether the need for such edits can be predicted automatically. For this task, we extend an existing resource of textual edits with a complementary set of approx. 4 million sentences that remain unedited over time and report on the outcome of two revision modeling experiments.


Introduction
Instructional texts have become an integral part of our daily lives, be it in the form of assembly instructions, product leaflets, troubleshooting guides, or board game manuals. A key property across all types of such texts is that they must be clear enough so that readers can actually achieve the goal described by the instructions.
Previous studies in computational linguistics have dealt with the clarity of specific types of instructions, such as route directions (Byron et al., 2009;Striegnitz et al., 2011) and software requirements (Willis et al., 2008;Yang et al., 2010). As an indicator for clarity, they relied either on successful execution in a virtual environment or on manual annotations of predefined ambiguity types. A large and more general dataset of instructional texts, wik-iHowToImprove, has recently been introduced by Anthonio et al. (2020). wikiHowToImprove consists of edits for about 2.5 million sentences derived automatically from revision histories of wikiHow, a collaboratively edited platform of how-to guides. In a set of human and computational experiments, Anthonio et al. (2020) show that such edits are often made to clarify or correct a sentence and that the difference between an "older" and "newer" version of a sentence can be predicted computationally.
We address two notable questions, using the work of Anthonio et al. (2020) as a starting point: (1) Are results for the task of distinguishing two versions of a sentence (henceforth version distinction) specific to instructional texts, such as their guides from wikiHow, or can underlying linguistic characteristics be modelled to a similar extent in a different text genre? We reproduce version distinction results for different computational models on a variant of wikiHowToImprove and provide comparison results on an earlier dataset of revision edits (WikiAtomicEdits) derived from Wikipedia (Faruqui et al., 2018). In our experiments, we find that models for version distinction work best on instructional texts and that they are capable of detecting a variety of potential reasons for revision, including grammatical errors, semantic implausibilities and vague expressions.
(2) Given the results on instructional texts, is it possible to model whether a sentence requires revision in the first place? wikiHowToImprove only contains edited sentences. We extend the dataset with sentences from the revision history that remain identical over time. Based on this extension, we introduce the task of predicting revision requirements and assess its feasibility by testing whether models can distinguish sentences that get edited from ones that remain unedited. Our results show that it is possible to identify sentences that are subject to revision with a F 1 -score close to 70%, indicating potential utility for downstream applications such as grammar correction (Yuan and Briscoe, 2016), ambiguity detection (Gleich et al., 2010), and machine translation refinement (Novak et al., 2016).
In summary, we make the following contributions: 1 First, we extend work on version distinction by providing experimental comparisons on wiki-How and Wikipedia. Second, we motivate a new  from the same articles that are part of wikiHow-ToImprove. For each article, we collect this set by identifying sentences that have remained unchanged from the article version they were first introduced until the last version of the article. Since sentences that are introduced in the last few versions are still likely to receive revisions, we use an additional filtering criterion that measure the ratio of the number of unchanged versions of a sentence and the total number of article versions.
In preliminary experiments (see Appendix B for more details), we tested ratios between 0.0 to 0.9 on the development set to find the most suitable value. The main difference we observed in these experiments was that data imbalance and noise would make it difficult for models to distinguish between sentences requiring revision and sentences not requiring revision. For our final experiments, we use a ratio of 0.75 because we found it to reduce noise to an acceptable level and led to an almost balanced set (see Table 2). Statistics of the train/dev/test split are given in the Appendix C.

Computational Models
For both tasks, we evaluate the following methods: Baselines. We apply as baselines for our experiments the open-source implementations of the methods from Anthonio et al. (2020): a multinomial Naive Bayes classifier with simple n-gram (n = 1, 2) features and a bidirectional long shortterm memory (LSTM) network with an additional attention layer (Zhou et al., 2016). We use the same hyperparameters as previous work.
BERT. For additional comparisons, we train new models based on BERT (Devlin et al., 2019), a multi-layer bidirectional transformer encoder which uses bidirectional self-attention to learn deep bidirectional representations. These representations can be fine-tuned using labelled data, which led to state-of-the-art results for a range  Experiment 2 only requires binary classification, thus the output layer simply applies a linear transformation followed by a softmax.

Experiment 1: Version Distinction in wikiHow and Wikipedia
The aim of the first experiment is to compare models on the task of distinguishing older and newer versions of a sentence between wikiHow and Wikipedia. We make use of the same general setup as previous work (see Anthonio et al., 2020). Our hypothesis for this experiment is that distinguishing sentence versions will be easier in wikiHow than in Wikipedia, because edits in the latter provide new information more often than refinements (Faruqui et al., 2018), whereas the content of wikiHow is largely independent of world knowledge that may change over time.
Results. Table 3 shows the accuracy of our models on wikiHowToImprove and WikiAtomicEdits. In comparison to the results reported in Anthonio et al. (2020), we observe that the baseline models are approx. 2.5% less accurate. A possible explanation for this drop is that we removed (pre-wikiHowToImprove Pick one band that has a large fan basis and . . . Pick one band that has a large fan base and . . .
It have much tricks and ways to interact with it. It has many tricks and ways to interact with it.
Plan out the details your story. Plan out the details of your story.

WikiAtomicEdits
She ends developing feelings for Naoto. She ends up developing feelings for Naoto.
She is also worked for the Consortium. She also worked for the Consortium.
He was Palmerston Park until 1931. He was at Palmerston Park until 1931. sumably easy to predict) typo-based edits from wikiHowToImprove for direct comparison with WikiAtomcEdits. The BiLSTM pairwise ranking model outperforms the BiLSTM binary classification model by 7.18% absolute accuracy in wikiHow but is 2.33% less accurate in Wikipedia. This is the case because the ranking mechanism can implicitly model information related to transitivity when there exist more than two versions of a sentence in wikiHow. In contrast, WikiAtomicEdits only contains two versions of each sentence and pairwise ranking only brings an additional overhead due to padding (see Appendix D for more details). The best results are achieved by BERT, which outperforms the BiLSTM ranking model by additional 1.72 percentage points on wikiHow and the BiLSTM classification model by 3.36% on Wikipedia. Finally, we see that all the models perform better on wikiHow than Wikipedia and the best performing model on wikiHow is 7.91% more accurate than on Wikipedia. This confirms our hypothesis that different versions of a sentence are harder to distinguish in the WikiAtomicEdits data.
Discussion. One reason for the higher improvement of BERT on WikiAtomicEdits could be that wikiHow and Wikipedia contain texts of different genre and only Wikipedia is used for pretraining the BERT model (see Section 2.2). In a qualitative wikiHowToImprove Follow instructions given by rail operators always. Always follow instructions given by rail operators.
If people insult you, don't act like you care. If people insult you, act like you don't care.
If you roll doubles, you go again. If you roll doubles, you roll again.
Clear the dog waste as it happens. Clear the dog waste immediately.

WikiAtomicEdits
In 1996 , he moved to play for Glenrothes. In 1996 , he moved back to play for Glenrothes. It returns an image which is automatically updated. . . . which is automatically updated each day.
Dice games and slot machines are forbidden. . . . are forbidden by state law. analysis of the results, we found that the BiLSTM and BERT models are able to detect typos as well as grammatically incorrect and ill-formed sentences in both data sets (see Table 4). The BERT-based model is further able to cover more subtle syntactic corrections and semantic clarifications. As exemplified in Table 5, these cases include improvements in terms of fluency and specificity, either through changes in word order or word choice (wikiHow-ToImprove) or through insertions of more detailed information (WikiAtomicEdits).

Experiment 2: Predicting Revision Requirements in wikiHow
The aim of this second experiment is to provide benchmark models for predicting whether or not a sentence requires revision. The previous experiment has shown that it is difficult to distinguish different versions of a sentence in WikiAtomicEdits ( §3). Therefore, we perform this experiment only on wikiHow. We make the simplifying assumption that all changes in wikiHow's revision history are made for the better and therefore represent needed revisions to the original version of an article. Thus, we treat all sentences that went through revision in   wikiHowToImprove as requiring revision and all unrevised sentences from our extension (see 2.1) as requiring no revision. We evaluate to what extent a model correctly identifies sentences that require revision using precision, recall and F 1 -score.
Results. Table 6 shows the results of our models. As shown in the table, the BiLSTM model outperforms the Naive Bayes model by 16.88 percentage points in precision, but only achieves a recall of 51.86%. This result indicates that contextual information within the sentence is needed for precisely predicting revision requirements, but it is not sufficient to achieve good coverage. The BERT model achieves the highest F 1 -score, outperforming the Naive Bayes and BiLSTM models by 4.39 and 7.67 percentage points, respectively.
Discussion. In a qualitative analysis of results, we find that all models are capable (to various degrees) of identifying grammar errors and sentences that are semantically implausible. A selected list of correctly classified example sentences are given in Table 7. As shown in the table, the BERT-based model seems to capture more subtle issues on the semantic level than the BiSTM model, including adjective degrees ("silly" vs. "sillier") and metaphorical comparisons ("X is a Y" vs. "X is like a Y"  such cases better than the BiLSTM because BERT is pre-trained on large amounts of fluent and wellwritten text. We further checked the performance of the BiL-STM and BERT-based models quantitatively on two specific types of cases from the 'subject to revision' category: grammatical errors and lexically vague modifiers. We automatically identify typical grammar errors 4 as well as cases where an adjective or adverb is replaced with a full phrase (e.g. "frequently" vs. "about once a week") by lexically and syntactically comparing the original sentence in the data to its revised version. Table 8 shows the counts of instances and correct predictions, indicating that both models have a reasonable recall regarding grammatical errors. BERT identifies more than three times as many cases of lexical vagueness as the BiLSTM model, but still only achieves a recall of 56.7% (93/164).

Related Work
Wikipedia Revisions. Revisions in Wikipedia have been leveraged for various NLP tasks, such as spelling error correction (Ehsan and Faili, 2013;Grundkiewicz and Junczys-Dowmunt, 2014;Zesch, 2012), preposition error correction (Cahill et al., 2013), paraphrasing (Max and Wisniewski, 2010), sentence simplification and compression , textual entailment recognition (Zanzotto and Pennacchiotti, 2010) and lexical simplification (Yatskar et al., 2010). Within this framework, a number of studies have analyzed the type of edits that authors made (Daxenberger andGurevych, 2013, 2012;Faruqui et al., 2018;Pfeil et al., 2006;Bronner and Monz, 2012;Liu and Ram, 2011) and their intentions (Yang et al., 2017;Zhang and Litman, 2016). These studies built further upon Faigley and Witte (1981) and Jones (2008). Daxenberger and Gurevych (2013) and Yang et al. (2017) performed multi-class classification to automati-cally detect edit types and edit intentions respectively. Other text classification studies focused on a smaller set of revision intentions in Wikipedia, such as Recasens et al. (2013) who worked on bias/nonbias detection. Attention has also been given to distinguish between vandalism and non-vandalism (Adler et al., 2011;Harpalani et al., 2011;Potthast et al., 2008) and between factual and fluency edits (Fong and Biuk-Aghai, 2010) wikiHow Revisions. Compared to Wikipedia revisions, wikiHow has received less attention in NLP. Apart from (Anthonio et al., 2020), there is no other work that leveraged the revision history of wikiHow articles. However, wikiHow has been used for summarization (Koupaee and Wang, 2018) and knowledge acquisition (Chu et al., 2017;Zhou et al., 2019). Others have also employed it to model procedure-specific relationships in sentences (Park and Motahari Nezhad, 2018) and underlying reasons for these relationships (Mishra et al., 2019). (2018), in a related task, worked with revisions in argumentative essays from ArgRewrite (Zhang et al., 2017). The authors trained a RandomForest classifier to predict, given an original sentence and a revised one, if the revised sentence is better than the original. Tan and Lee (2014) conducted a related study, which analyzed potential strength differences in originalrevised sentence pairs in academic writing using a qualitative approach.

Conclusions
We demonstrated in an experimental comparison that it is easier to distinguish sentence versions computationally in wikiHowToImprove than in WikiAtomicEdits. We further introduced a new task of predicting whether a sentence requires revision and showed promising first results on specific types of revisions. As next steps, we plan to address further types of revisions and extend our experiments to document-level settings.

A Filtering Typos
For each revision pair (base, revised), we find the sentence edits and edit types using Levenshtein distance algorithm. If all the edits are of substitution type and every substitution fixes a typo, we remove the base sentence.

B Selection of Filtering Ratio.
In order to select an appropriate filtering ratio, we ran Experiment 2 (Predicting Revision Requirements) with 10 different ratios from 0.0 to 0.9. Based on the results on our validation set (see Figure 1) and the data imbalance at each ratio, we selected a ratio of 0.75, which lead to an almost balanced set. A ratio of 0.75 means that sentences have to remain "identical" for the last 3 out of 4 article-level revisions in order to be considered as "not requiring revision".   length, so no padding is required. But for the pairwise ranking models, we have to batch version pairs (base, revised) together. We can only batch these pairs if the number of tokens in both versions of a sentence are equal. So we first append pads to the shorter sentence in the version pair to make its length equal to the longer sentence and then batch these pairs based on length.