A Report on the Automatic Evaluation of Scientific Writing Shared Task

,


Introduction
The vast number of scientific papers being authored by non-native English speakers creates an immediate demand for effective computer-based writing tools to help writers compose scientific articles. Several shared tasks have been organized before that in part addressed this challenge, all with English language learners in mind: Helping Our Own, HOO, with two editions in 2011 and 2012 (Dale and Kilgarriff, 2011;Dale et al., 2012); and two Grammatical Error Correction Tasks in 2013and 2014Ng et al., 2014). The four shared tasks focused on grammar error detection and correction, and constituted a major step towards evaluating the feasibility of building novel grammar error correction technologies.
An extensive overview of the automated grammatical error detection for language learners was conducted by Leacock et al. (2010). In subsequent years two English language learner (ELL) corpora were made available for research purposes (Dahlmeier et al., 2013;Yannakoudakis et al., 2011). While these achievements are critical for language learners, we also need to develop tools that support genre-specific writing features. This shared task focused on the genre of scientific writing.
Most scientific publications are written in English by non-native speakers of English. Submitted articles are often returned to the authors with an encouragement to improve the language or have a native speaker proofread the paper. Pierson (2004) lists 10 top reasons why manuscripts are not accepted for publication, with poor writing in the 7th place.
In Section 2, we describe the task and its objectives; Section 3 gives an overview of the data set; Section 4 introduces the participating teams; Section 5 describes the framework used for organizing competitions; Section 6 summarizes the results of the task; Section 7 provides a detailed analysis and discussion of the results; and, finally, Section 8 presents the main conclusions of the Shared Task and our proposed future actions.

Task Definition
The goal of the Automated Evaluation of Scientific Writing (AESW) Shared Task was to analyze the linguistic characteristics of scientific writing to promote the development of automated writing evaluation tools that can assist authors in writing scientific papers. More specifically, the task was to predict whether a given sentence requires editing to ensure its "fit" within the scientific writing genre. The main goals of the task were to -identify sentence-level features that are unique to scientific writing; -provide a common ground for development and comparison of sentence-level automated writing evaluation systems for scientific writing; -establish the state-of-the-art performance in the field.
A few words should be said about the specifics of the scientific writing data set. Some proportion of "corrections" in the shared task data are "real error" corrections -i.e. such that most of us would agree that they are errors -for example, wrong pronouns and various other grammatical errors. Others almost certainly represent style issues and similar "matters of opinion", and it seems unfair to expect someone to spot these. This is because of different language editing traditions, experience, and the absence of uniform agreement of what "good" language should look like. The task was organized to create a consensus of which language improvements are acceptable (or necessary) and to promote the use of NLP tools to help non-native writers of English to improve the quality of their scientific writing.
Some interesting uses of sentence-level quality evaluations are the following: -automated writing evaluation of submitted scientific articles; -authoring tools for writing English scientific texts; -identifying sentences that need quality improvement.
The task is defined as a binary classification of sentences, with the two categories needs improvement and does not need improvement. Two types of predictions are evaluated: Binary prediction (False or True) 1 and Probabilistic estimation (between 0 and 1).

The Data Set
The data set is a collection of text extracts from 9,919 published journal articles (mainly from Physics and Mathematics) with data before and after language editing. The data are from selected papers published in 2006-2013 by Springer Publishing Company 2 and edited at VTeX 3 by professional language editors who were native English speakers (Daudaravicius, 2015). Each extract is a paragraph that contains at least one edit made by the language editor. All paragraphs in the data set were randomly ordered from the source text for anonymization. Additionally, identifying parts of the text were replaced with placeholders, specifically authors, institutions, citations, URLs, and mathematical formulas. This  replacement was done automatically and is based on annotation in primary data sources that were L A T E X files 4 . This dataset will be made freely available on the Internet 5 for replications and other studies. Sentences were tokenized automatically, and then both text versions -before and after editing -were automatically aligned with a modified diff algorithm. Some sentences have no edits, and some sentences have edits that are marked with <ins> and <del> tags. The text tagged with <ins> is the text that was inserted by the language editor, and the text tagged with <del> is the text deleted by the language editor. Substitutions are tagged as insertions and deletions because it is not always obvious which words are substituted with which. Some edits introduce or eliminate sentence boundaries. In such cases, a few sentences are combined into one data set sentence and, therefore, the number of tagged sentences in the data set differs before and after editing (see Table 2).
The training, development and test data sets comprise data from independent sets of articles (see Table 2).
-The training data: A fragment of training data is shown in Table 3 where multiple insertions and deletions can be seen. -The development data: The development data is distributionally similar to the training data and the test data with regard to the edited and <sentence sid="9.1"> For example, separate biasing of the two gates can be used to implement a <del>capacitor-less</del><ins>capacitorless</ins> DRAM cell in which information is stored <del>in</del><ins>at</ins> the <del>form</del><ins>back-channel</ins> <del>of</del><ins>surface</ins> <del>charge</del><ins>near</ins> <del>in</del><ins>to</ins> the <del>body region,</del><ins>source</ins> <del>at</del><ins>in</ins> the <del>back channel</del><ins>form</ins> <del>surface</del><ins>of</ins> <del>near</del><ins>charge</ins> <del>to</del><ins>in</ins> the <del>source</del><ins>body region</ins> _CITE_. </sentence> non-edited sentences, as well as the domain. -The test data: Test paragraphs retain texts tagged with <del> tags but the tags are dropped. Texts between <ins> tags are removed. However, all edits of the test data were provided to the teams after the final results were submitted.

Supplementary Data
To speed up data preparation for training, development and testing, the following supplementary data were accessible to all participants: Training, development and test data split into text before editing and text after editing: -Tokenized sentences with sentence ID at the beginning of the line. -POS tags of sentences with sentence ID at the beginning of the line. -CFG trees of sentences with sentence ID at the beginning of the line. -Dependency trees of sentences with sentence ID as the first line of each tree.
Texts from Wikipedia articles (the dump of April 2015): -Tokens -POS tags -CFG trees of sentences -Dependency trees of sentences The data were processed with the Stanford parser with the following parameters: -model: englishRNN -type: typedDependencies -JAVA code for grammatical structure: GrammaticalStructure gs = parser.getTLPParams(). getGrammaticalStructure(tree, Filters.acceptFilter(), parser.getTLPParams(). typedDependencyHeadFinder()); Shared Task participating teams were allowed to use other publicly available data with the exclusion of proprietary data. All additional data should in that case be specified in the final system reports. The participants were encouraged to share their supplementary data, where relevant.

Participants
By the time of data release, 18 groups were registered for the task. The data required an agreement which allows its use under the Creative Commons CC-BY-NC-SA 4.0 license with a few extra restrictions. The six groups that submitted results and published system reports are listed in Table 1, with participants spanning several continents.
A high-level summary of the approaches used by each team is provided in Table 5. The most common methods were deep learning (HU and NTNU-YZU) and maximum entropy (Knowlet and UW-SU). The other teams used logistic regression and support vector machines. The deep learning models used only tokens and word embeddings as their features. NTNU-YZU represented sentences as a sequence of word embeddings to train a convolutional neural network (CNN). HU had a more complex approach, reporting the majority vote of a CNN using word embeddings and stacked character-based and word-based Long Short-Term Memory (LSTM) networks.
Besides tokens and token n-grams, the most common features were parse trees (ISWD and UW-SU). ISWD used tree representations of the sentences as features for a SVM and UW-SU augmented a grammar with a series of "mal-rules", which license ungrammatical properties in sentences, and identified if the mal-rules occurred in the most likely sentence parses. HITS implemented 82 specific features for this task, including counts of word types, patterns found in words (such as contractions), and probabilities. Knowlet tested the efficacy of existing grammar tools for this task by train their model using features extracted from LanguageTool and After the Deadline.

CodaLab.org
In this section we share our experience of using Co-daLab 6 for the AESW Shared Task. CodaLab is an open-source platform that provides an ecosystem for conducting computational research in a more efficient, reproducible, and collaborative manner. On codalab.org, we used Competitions to bring together all participants of the AESW Shared Task and to automate the result submission process. Each participant had to register on the codalab.org system and apply to the task in order to submit results and receive evaluation scores. We created four evaluation phases to distinguish four evaluation tasks: The training and development data were released on December 7, and the test data and CodaLab evaluation opened on February 29. The deadline for submitting results was March 10.
Participants were allowed to submit results many times (up to 100 submissions per day), with no more  than two results for their final submission in each track. Our experience shows that the time span for evaluation can take one minute to a few hours. Table 4 shows the number of successful submissions of each participant for each evaluation phase. The average number of submissions for each evaluation phase was six times except for one participant. In principle, the multiple unlimited number of submissions allows a team to tune their system based on performance against the test set as revealed by the automated scorer. The number of failed uploads is around ten percent. Therefore, our advice for future implementations of similar shared tasks is to limit the number of uploads to five times in the testing phase.
The system allows us to upload scorer programs and reference data to the server such that participants cannot see the reference data, which guarantees that the scorer program runs honestly. The scorer program was initially built using the Haskell programming language, but we could not manage to run the executable on the server despite the documentation describing such a possibility. Therefore, the scorer program was reimplemented in Python. The scorer program written in Python demonstrated unexpected behavior at the end of the testing phase: The codalab.org system did not report any errors if participants submitted a truncated list of predictions. One team uploaded a truncated list of predictions that was accepted and scored. The scores were close to a random prediction score. After double checking all submitted results, we discovered that the system accepted results even if the list size of predictions was shorter than its expected size. This happened due to the implementation difference of the zip function in Haskell and in Python. In Haskell, the length of both lists should be equal to apply the zip function, otherwise an error is thrown. In Python, the zip function merges two lists while a pair of values can be created, and does not throw an exception when the lists are of unequal lengths. One particular team was warned and an additional day was given for correcting their system and re-submitting their results. The lesson learned is that even if a scoring program produces an output score, double checking the final scores should be done manually.

Results
In this section, we describe the results of both tracks of the shared task.
First, we define the primary evaluation metric for both tracks, the F 1 score: For the Binary decision track, precision and recall are defined as where TP (true positive) indicates the number of sentences correctly predicted to need improvement; FP indicates the number of false positives, or the sentences incorrectly predicted to need improvement); and FN (false negative) is the number of sentences incorrectly predicted to not need improvement. We additionally report Pearson's correlation coefficient and the agreement calculated with Cohen's kappa.   For the Probabilistic estimation track, rankings are calculated based on F 1 score using the mean squared error (MSE): For a sentence i, G i = 1 if the sentence needs improvement in the gold standard, otherwise G i = 0. π i is the probabilistic estimate that the sentence needs improvement, n is the number of sentences predicted to need improvement (π i > 0.5), and m is the number of sentences that actually need improvement. We also calculated the cross-entropy between the predictions and gold standards, defined as Finally, we represented each probability with its corresponding boolean value (y i = True if π i > 0.5 else y i = False) and calculated the binary-task F-score (with precision and recall calculated as in Equations 1 and 2), the correlation, and agreement statistic.
The results of the Binary decision task are shown in Table 6. The results for the Probabilistic estimation task are provided in Table 7 and the analysis over the corresponding boolean values is shown in Table 8. When a team submitted more than one set of results, we identify the two submissions as TEAM and TEAM † .   a UW-SU reported all probabilities πi > 0.5, and therefore σπ = 0 and r is undefined.
tem is NTNU-YZU, which trained a CNN model. Both of these models used word2vec word embeddings, with NTNU-YZU testing both word2vec and GloVe. The bottom two teams according to the Fscore, NTNU-YZU † and Knowlet, have the third and fourth strongest agreement with the gold standard, respectively. Compared to the other submissions, these systems have the highest precision of 0.6717 and 0.6241, respectively, with the precision of the other systems ranging from 0.38 to 0.54. They also had the lowest recall (0.3805 and 0.3685) compared to the other teams, with recall between 0.70-0.95. This suggests the importance of precision in this task.
For the Probabilistic estimation track, HITS had the highest precision (0.9333) and F-score (0.8311) ( Table 7). The other teams all had precision >= 0.66 and recall >= 0.62. However, the rankings found by the F-score and the correlation diverge significantly for three systems: HITS, NTNU-YZU † , and Knowlet. While HITS has the highest F-score, it also has the weakest correlation with the gold standard. NTNU-YZU † and Knowlet have the lowest F-score but the first and third strongest corre-lation, respectively. The ranking by cross-entropy is similar to the F-score ranking with the exception of HITS, which has the fifth highest cross-entropy. To address this disparity, we calculated additional rankings of the systems by converting the output probabilities into the corresponding boolean value (True if π i > 0.5, and False otherwise) and reporting the values of the same metrics we used to evaluate the Binary prediction task (Table 8). These statistics are indicated with a subscript b. In this analysis, the ranking of HITS declines significantly from the original Probabilistic evaluation, with the lowest F-score b of all systems. The precision b of HITS is nearly perfect (0.9282) but recall b is almost 0 (0.0129), which explains why the F-score b , Correlation b , and Kappa b statistics are all so low. Knowlet improves to the fourth-ranked system by F-score b from the last. By the correlation and agreement statistics, NTNU-YZU and NTNU-YZU † are the best two systems in the converted probabilities analysis.
As demonstrated, different statistics produce dissimilar system rankings. The official scores for both tasks are the F-score, as defined in the workshop description, but there is evidence that the evaluation could be improved in future tasks. UW-SU and HITS pointed out that favoring recall over precision improves their F-score, which increases the system's ranking but decreases its accuracy. Precision has been shown to be more effective when providing feedback on grammatical errors, with less, accurate feedback better than inaccurate feedback (Nagata and Nakatani, 2010). For future shared tasks, additional evaluation methods should be investigated, including F 0.5 , which weights precision more than recall, and a comparison to human evaluation, such as is done by the Workshop on Machine Translation (Bojar et al., 2015).

The trends of system predictions
The initial impetus to organize this competition was to gain insight into the specifics of scientific writing as a genre and, with the help of participants, to make an estimation of whether it is possible to offer any robust automatic solutions to support researchers with non-native English background in writing scientific reports. There are several facts and their implications to be considered: -The first fact deals with formal requirements of the genre. Scientific writing has very clear -and to a certain extent limited -aims, namely to inform other researchers in the field of the latest findings or important issues, usually presented in the form of articles, reports, grant proposals, theses, etc. Each of these follow roughly the same structure comprising more or less obligatory parts (e.g. abstract, data, method). The intended audience -i.e. other researchers -should be familiar with the standard to be able to skim for major findings and conclusions in the document, not wasting time on irrelevant parts. Scientific language is therefore rather rigid to fit this need. -Another fact we need to consider is that most of the scientific writing is done by mature users of English, who in most cases do not make second-language-learner types of mistakes, at least not frequently. This fact is reflected in the type of edits in our data: they are corrections, mostly reflecting linguistic conventions of the genre. Correct use of punctuation, hyphenation, digits, capitalization, abbreviations, and domain-appropriate lexical choices are the type of corrections that dominate professionally proofread scientific papers, and are unique to scientific writing. Among more classical second-language type of errors, we can see verb (dis)agreement; (in)appropriate use of articles, prepositions and plurals, (mis)spellings, (in)correct choice of word, etc. However, these traditional error types are much less represented in scientific writing.
To see how successfully our task participants have coped with the challenges of scientific writing, we have analyzed main trends concerning which error types were detected by all algorithms (successfully detected as 'need improvement' by all systems) versus which none were able to capture (i.e. sentences that were annotated as 'need improvement' but no one could detect these sentences).
There are four cases presented in Table 9:  We can observe 7,899 cases of successful agreement between the proofreaders and all the teams about sentences that need correction. Most cases of article misuse, punctuation infelicities, diverting capitalization, unconventional usage of digits, abbreviations and hyphenation were detected by all teams, including sentences where lexical choices were not optimal, e.g.: -For computations we chose _MATH_ and a spectral interval in the vicinity of the resonance frequency of the mode with radial number _MATH_, _MATH_. -Provided _MATH_ has no zero in its initial data, the log-logarithmic singularity at _MATH_ causes the left -hand side to blow up at _MATH_, thereby forcing _MATH_ as _MATH_.
-Given this reasoning we have evaluated the one one loop and eighteen18 two loop vacuum bubble graphs contributing to (_REF_).
-SimilarSimilarly to the previous case, we have a line of fixed points with positive slope _MATH_ in (_MATH_, _MATH_) plane as shown in Figure 2.
In 32 cases all the teams have unanimously disagreed with the gold standard on the need of correction. These cases cover -context deficit, where on the sentence level it is impossible to identify the correct need of an article or an adverb, e.g.: -Next, we give thea stability analysis.
-The algorithm then terminates. -alternative lexical choices, in particular more formal variants or special terminological usages, e.g. notice versus note, fitted parameter versus fit parameter; -a number of notorious matters of opinion, such as replacing this paper for the paper and vice versa, e.g.: -TheThis paper is organized as follows.
-First, we derive the following: _MATHDISP_. -style/tense requirements of the genre, e.g. using present tense referring to the results in tables: -The results wereare presented in Figure  _REF_. -use of punctuation in the following cases: -Namely, we observe the following.: -Example:. -stylistic preferences: -Since _MATH_ and _MATH_, we can easily get _MATH_. -This error is only limited by the instrument resolution of the instrument.
It can be argued that in most of those 32 cases corrections are optional.
One conclusion that can be drawn from this task performance analysis is that scientific writing as a genre needs standardization. We have encountered several types of inconsistencies in the data, for example in the case of hyphenation (nonlinear for non-linear; and vice versa); or in the case of expressions like this paper for the paper and vice versa. Even though it seems that the area could benefit from standardization, we are well aware that language can never be fully standardized. At most, there are only and can only be guidelines or consensus on what good language should look like.
Another conclusion is that automatic detection of scientific prose errors as an area of research would benefit from error-type annotation. More rigorous analysis of the data in terms of the type of corrected deviations could give us a better insight into what the genre of scientific writing is and facilitate more error-aware approaches to automatic proofreading of scientific papers.
Yet another conclusion is that stepping outside of a sentence boundary may facilitate recognition of a number of other error types that at the moment go unnoticed due to context deficit, among others inconsistent use of abbreviations, certain cases of article usage, and lacking adverbs, just to name a few.

Conclusions
In this work we have reported and described the results of the AESW Shared Task (Automatic Evaluation of Scientific Writing), which focuses on the problem of identifying sentences in scientific works that require editing. The main motivation of this task is to promote the use of NLP tools to help non-native writers of English to improve the quality of their scientific writing. From the research perspective, on the other hand, this effort aims at promoting a common framework and standard data set for developing and testing automatic evaluation systems for the scientific writing domain.
From a total of 18 groups registered for the shared task, six of them submitted results and published reports describing their implemented systems. As a consequence, different machine learning paradigms (including neural networks, support vector machines, maximum entropy, and logistic regression) have been tested over the two proposed evaluation modalities (binary and probabilistic estimation). The shared task has helped establish a reference for the state-of-the-art in the automatic evaluation of scientific writing, in which the obtained results demonstrate that there is still room for improvement. The availability of both the data set and the evaluation tools will facilitate the path for future research work in this area.
In the future, we plan to improve on current system performances by implementing and evaluating different system combination strategies. Additionally, as suggested by the observed ranking inconsistencies across the different evaluation metrics in the probabilistic estimation task, we also need to conduct further analysis and take a more detailed look at these results to determine the best evaluation scheme to be used for this modality.