Treat the system like a human student: Automatic naturalness evaluation of generated text without reference texts

The current most popular method for automatic Natural Language Generation (NLG) evaluation is comparing generated text with human-written reference sentences using a metrics system, which has drawbacks around reliability and scalability. We draw inspiration from second language (L2) assessment and extract a set of linguistic features to predict human judgments of sentence naturalness. Our experiment using a small dataset showed that the feature-based approach yields promising results, with the added potential of providing interpretability into the source of the problems.


Introduction
More and more text is generated in Machine Translation, Text Summarization, Image Captioning, and Dialogue Systems. With this increased usage of Natural Language Generation (NLG) comes an increase in the importance of evaluating the language generated, and an increase in the difficulty of doing so as the quantity and variety of output increases. Automatic NLG evaluation focuses on two areas: accuracy and fluency. The former assesses how well the generated text conveys the desired meaning, while the latter assesses how well the language flows: the 'linguistic quality of the text' (Gatt and Krahmer, 2018) and whether it sounds like something a native speaker of the language would naturally produce. This paper focuses on the latter. We first review current approaches in metrics-based evaluation, in referenceless evaluation and in second language (L2) language assessment; we then present our experiment in section 3.
1.1 Metrics system using human reference set -the lion's share NLG evaluation has traditionally relied on human judgments (Mellish and Dale, 1998). Beyond that, the predominant automated method is to compare generated text with one or more human-created reference texts using a metric-based system (Gatt and Krahmer, 2018). The more similar the system output is to the human authored text, the better the system is judged to be. Popular metrics include BLEU (Papineni et al., 2002), ROUGE (Lin and Hovy, 2003), NIST (Doddington, 2002), METEOR (Lavie and Agarwal, 2007) and CIDEr (Vedantam et al., 2015), among others. Up to 60% of NLG research published between 2012 and 2015 relied on such metrics (Gkatzia and Mahamood, 2015) However, it has repeatedly been found that automated metrics do not correlate well with human evaluations of generated text (Stent et al., 2005;Belz and Reiter, 2006;Reiter and Belz, 2009) and that the correlation is weaker at sentence-level than when evaluating a system overall. (Novikova et al., 2017a;Shimorina, 2018). Novikova et al. (2017a) compared popular comparison metrics used to evaluate NLG systems, concluding that the current state-of-the-art metrics are insufficient and cannot replace human judgments. They demonstrated that all the aforementioned automated metrics based on word-overlap with reference texts were strongly correlated with each other and only weakly correlated with human judgments of naturalness and quality. Furthermore, the least weak correlation found between any metric and human naturalness judgments was on the least varied dataset that only expressed a limited set of attributes and had less lexical diversity as it was only partially lexicalised (all proper names were replaced by placeholder variables). Given that lex-icalisation is a source of ungrammaticality in NLG (Sharma et al., 2016), this dataset therefore does not fully represent the challenge of evaluating the final output of an NLG system.
In addition to accuracy concerns, using a metrics system with a human reference set has several practical limitations. Firstly, building reference sets tends to require experts (e.g. translators) and is thus costly to create. Secondly, an output that is different from a human-written reference is not necessarily a bad sentence for the task: there are often multiple valid ways to express a desired meaning. The evaluation therefore requires multiple reference sentences, which makes producing a reference set even harder and generates complexities in similarity calculation. Thirdly, creating a human gold standard is not suitable for fast or large scale assessment. For NLG systems that cover a large variety of topics, the quantity of reference sentences required can be prohibitive to using this approach during system development.

Moving away from human reference set
We should look beyond evaluation using human references and learn from research outside our immediate domain, since there has been more research into automatic evaluation of text without human references in tasks similar to NLG than there has been for NLG itself.
One such domain is second language learner (L2) language assessment. Here the target is not machine-generated text but human-produced text. Over the last decade, a large body of work has identified linguistic features that indicate language fluency and complexity (Hancke et al., 2012;Feng, 2010;Chen and Zechner, 2011;Lu, 2010;. The linguistic feature based models in L2 assessment seem to correlate more strongly with human judgments of naturalness than current NLG evaluation metrics (with the caveat that these are different tasks). Many of the features require syntactic and discourse parsing, and they capture linguistic knowledge of what makes sentences readable and natural, as reflected in psycholinguistic studies on reading and parsing effort. These features are often more interpretable than purely statistical metrics, so potentially they allow us to not only evaluate the naturalness of a sentence or document, but also to identify why it is good or bad.
Another relevant domain is automatic grammat-icality judgment. Wagner et al. (2009) investigated grammaticality classification using features such as part-of-speech (POS) n-gram frequencies and the output of probabilistic parsers trained on corpora of grammatical and ungrammatical sentences. They found that parse probability is reduced by spelling, agreement and verb form errors. Heilman et al. (2014) also found linguistic feature based models to be effective when using spelling, language model and grammar features from different parsers. They found that ngram frequencies and the ability to be parsed were the most influential features for indicating grammaticality. This feature-based method also proved effective in grammaticality evaluation when applied to grammatical error correction applications (Napoles et al., 2016). In Machine Translation, quality estimation without reference texts has been the subject of multiple shared tasks (Bojar et al., 2017). The QuEst 2015 sentence level model (Specia et al., 2015) 1 that provided the baseline for the latest completed task uses features of the source and/or target sentences including features from language model scores, length, part-of-speech and dependency parsing The leading system (Kim et al., 2017) in the 2017 task used an end-to-end stacked neural model consisting of a bilingual neural word prediction model and neural quality estimator model. The next best performing team's submission (Martins et al., 2017) used a stacked combination of a linear feature-based model (with dependency, POS and syntactic features) with a neural network.
Within NLG evaluation, Novikova et al. (2017a) examined the correlation between human evaluations and grammar-based measures that indicate readability and grammaticality. To measure grammaticality, they used the number of misspellings and the Stanford parser parsing score. Using the Flesch Reading Ease score (Flesch, 1979) and various other measures of complexity such as character, word, syllable and sentence counts, they found that, at a system level, systems producing utterances of higher readability and shorter word length received higher naturalness and overall quality ratings from humans. However, at sentence level there was no strong correlation between such metrics and human ratings that could reliably identify generated sen-tences with low readability or low grammaticality. This evidence that the linguistic features of texts do correlate with human judgments in NLG but that no single feature does so with a strong correlation supports our proposal that combining multiple grammatical features could automatically identify the quality of generated sentences.
We apply the feature-based approach used elsewhere by trying to identify whether machinegenerated sentences are fluent and natural, and compare the predictions with human produced labels. Unlike previous work on grammaticality prediction we focus on the notion of "naturalness" or "fluency" rather than just grammaticality. This is because 1) psycholinguistic studies have shown that human perception of grammaticality is gradient (Keller, 2001), and 2) for most systems involving NLG, it matters how easy it is for humans to understand the sentences, not just whether the sentences are grammatical. With this in mind, we use features to capture the ease of parsing (influenced by grammaticality and syntactic complexity) and semantic soundness (influenced by word collocations and frequency). One recent investigation into NLG evaluation without reference texts that we are aware of used a recurrent neural network to estimate quality using the meaning representation input and output sentence to estimate the overall quality (Dušek et al., 2017). Our work differs in the use of linguistic features, which have proved successful in other domains and offer the prospect of interpretability, and we maintain the separation between evaluating the adequacy of the semantic content and evaluating the fluency of the text as has been found to be advisable for NLG evaluation (Stent et al., 2005).

Deriving the linguistic feature set
Expanding on the literature on L2 language assessment, especially (Hancke et al., 2012), and on grammaticality evaluation, we derived five groups of features (see full list in Table 1).

Lexical features
Lexical features include counts and ratios of words, lemmas and Part-of-Speech (POS) tokens. Type-Token Ratio (TTR), the ratio of the number of word types (in terms of lemmas) to total number of word tokens in a text, and its variants are used to measure lexical variation in language acquisition studies. We adopted the variations described in  and word counts by POS categories, extracted using spaCy 3 .

Constituency parse features
We used the BLLIP reranking parser (Charniak and Johnson, 2005), which includes a generative constituent parser and a discriminative maximum entropy reranker, and the WSJ-Gigaword-v2 model which consists of the Wall Street Journal corpus from Penn Tree Bank and two million sentences from Gigaword. From the parser output we used as features the parser log probability and reranker log probability of the most likely parse after reranking the 50-best parses. The idea is that parse probability reflects parser confidence and correlates with sentence quality (Mutton et al., 2007). We also added features for kurtosis and skew of the log probabilities of the 50 most likely parses, based on the idea that the distribution reflects sentence grammaticality and readability (Wagner et al., 2006). Our intuition was that a well-formed grammatical sentence would have positive skew and high kurtosis dropping steeply from the highly probable best parse to other much less likely parses. Conversely, an ungrammatical sentence would have a flatter kurtosis as none of the parses are very probable. Other features include tree height (length of the longest path from the root), number of subtrees, proportion of nonterminal subtrees, the number and mean token length of Noun Phrase (NP), Verb Phrase (VP) and Adjective Phrase (AdjP) sub-trees.

Dependency parse features
Using the spaCy dependency parser, we extracted the root word of the dependency tree and its part of speech, the tree height and the subtree height to either side of the root. The part of speech of the root is an indicator of whether the sentence has a main verb. The size of the tree on either side of the root reflects whether a sentence is "top" or "tail" heavy, or more balanced. This feature is based on the principle that sentences are easier to process, and thus are judged to be natural and well worded, if the dependencies of the head are roughly evenly distributed on either side (Temperley, 2008), and that heavy noun phrases are hard to process at the beginning of the sentence (Stallings et al., 1998).  Table 2: Results of baselines, top two feature-based classifiers and models using subset of features.

Language Model based features
A Language Model (LM) represents the probability distribution of n-grams in a corpus and can measure how "surprised" the model is to see a sentence. We used both POS-based LMs and word-based LMs. For POS-LMs, the POS sequences of each sentence were evaluated against unigram, bigram and trigram POS-based LMs trained on the Wall Street Journal corpus made available in CoNLL2000 (Tjong Kim Sang and Buchholz, 2000). Word-based LMs were trained using the KenLM package (Heafield et al., 2013). We trained two models, one using an English news corpus (available at (Heafield et al., 2013)), and the other using WikiText (Merity et al., 2016). The score was calculated as log 10 p(sentenceh/si|hsi) where hsi and h/si are the symbols for beginning and end of sentence, respectively. This reflects, after seeing a start-ofsentence symbol, the probability of a sentence appearing and being followed by an end-of-sentence token. Perplexity of a sentence was calculated with 10.0 score(sentence) length(words)+1 .

Grammar checker
We used the open source rule-based grammar checker LanguageTool 4 (Naber, 2003) to output a binary label of whether a sentence violates any of the English grammatical rules encoded in this tool.

Data description
We collected our ground-truth evaluations through Amazon Mechanical Turk, asking participants to read machine-generated sentences and judge whether or not they are "perfectly good" English sentences. We opted for a binary judgment task rather than a graded one to make the judgment task simple for participants. The sentences evaluated were 4000 machine-generated sentences from the data released in the 2007/2008 Workshops on Statistical Machine Translation 5 . We did not use the provided human evaluation results because these were evaluations of adequacy, i.e. a mixture of overall quality, content accuracy, and fluency, and the labels were system rankings. We randomly allocated 4000 generated sentences into 40 lists. Each participant read 100 sentences and judged whether each was a "perfectly good" sentence that would sound grammatical and natural to someone with a high proficiency in English. Each sentence was judged by at least 5 participants. Overall, most sentences received the "Not Perfect" rating ( Figure 1). The Fleiss kappa on the whole data set is 0.3. We then categorized sentences into "Perfect" (more than 70% "Perfect" judgments), "Not Perfect" (less than 30% "Perfect" judgments), and "Not Sure" (the remainder). There were 603 "perfect" sentences and 2637 "Not Perfect" ones, which were used for model training and evaluation. The 929 "not sure" sentences were excluded.

Training a classifier: Results
We trained "naturalness" classifiers in two ways: using a deep learning model on sentences represented by FastText word embeddings (Bojanowski et al., 2017), and using linguistic features. The deep learning model uses a pooled bidirectional Gated Recurrent Unit (GRU) architecture (Chung et al., 2014). After excluding data with missing feature values, there were 2934 observations for the models, 512 of which were "perfect". We split the data into three sets of equal size, two for training and one for testing.
Given the small dataset, the deep learning model serves as a baseline.
It attained a marginally better weighted F1 than an "assumeall-not-perfect" baseline and a similar accuracy.
For the feature based models, we scaled numerical features to be centered around 0 with a standard deviation of 1. Categorical features were encoded in an 1-hot fashion so each level becomes a feature on its own. Using Scikit-learn (Pedregosa et al., 2011), we trained the following classifiers: Linear LVC with L1, L2 or combined penalty, Logistic Regression, KNeighbours Classifier, Ran-domForest, Perceptron, SGDClassifier and XGboost (Chen and Guestrin, 2016). We used the optimal hyper-parameters for each classifier acquired after running a 5-fold cross validation. We trained all classifiers 10 times and calculated the mean accuracy and F1 of the 10 sessions. The top six classifiers had very similar performances (Logistic Regression, LinearLVC with L1, L2 or combined penalty, RandomForest, SGD classifier). We report the mean results of the top two models in Table 2.

Error Analysis
When predicting the naturalness of 969 sentences, of which 158 were " Perfect", the top performing RandomForest model labeled 861 out of 969 (88.85%) correctly. It produced 87 incorrect "Not Perfect" labels, and 21 incorrect "Perfect" labels. The incorrect "Not Perfect" labels consisted of three main categories: long sentences (especially those with subordinate clauses), split sentences with inserts (e.g. "I shall, of course, inform the President of your comment.") and non-sentential segments that human judges deemed natural (e.g. "The Value of European Values."). Among the incorrect "Perfect" labels, some were assigned to sentences with isolated grammatical errors, such as incorrect verb agreement (e.g. "The Nobel laureate Gary Becker disagree with this view."), incorrect prepositions (e.g. "The journal Science on the issue last autumn published several contributions.", or word order errors (e.g. "What now we can do?"). The overall impression is that the sentences judged to be "Perfect" by the model are easier to read, and are less complex than ones judged to be "Not Perfect".

Feature Analysis
Different classifiers agreed on the top weighted features, but gave different rankings to features with lighter weight. The highest ranking feature for the top six classifiers is the parser-reranker probability, echoing previous findings that parse probability can be used to evaluate grammaticality (Mutton et al., 2007). Other top features include number of tokens, number of verbs, constituency tree height and dependency tree height. The effectiveness of Language Model Perplexity and Score is sensitive to the corpora that the model is trained on. In this experiment, LM features trained on the Wikipedia data gave the whole model a .02% boost in F1 compared to LM scores trained on news corpora. We also tested a classifier that used the language model perplexity as the only feature in training and testing, and found this to be less accurate. This indicates that although a language model captures some notion of the likelihood of a sentence, it does not fully encapsulate all that is involved in making a sentence sound natural. Perhaps surprisingly, LanguageTool contributed very little. We realized that the rules it uses to detect grammatical errors are mostly linear and struggle with constituents involving longer dependencies. For example, LanguageTool judged the sentence "I represent a number of sugar beet growers and I am therefore very concerned." to violate the rule "MANY NN U", meaning that the quantifier "a number of" is followed by the uncountable noun "sugar", while the actual head noun is "growers".
For a feature ablation study, we used the Scikitlearn implementation of Recursive Feature Elimination to identify which features contributed most to the best performing model, the Random Forest Model. Retraining and testing on subsets of features found that using just the 11 best-performing features achieves the same F1 and accuracy as the model that used all the features. Adding additional lower-ranked features beyond that brought no significant additional benefit (Figure 2). These 11 features were: parser probability, reranker probability, reranker score kurtosis, reranker score skew, average length of verb phrases, the POS language model bigram score, root TTR, corrected TTR, lexical repetition, language model score and language model perplexity.

Model and Feature Set transferability
How well would our naturalness model trained on a small dataset in one domain -MT generated sentences about European politics -perform on an entirely different domain? To test the transferability, we used data provided by Novikova et al. (2017a) 6 of sentences produced by NLG systems participating in an end-to-end (E2E) NLG chal-6 https://github.com/jeknov/EMNLP 17 submission  (Novikova et al., 2017b). We used the data from the lexicalised datasets SFRES and SFHOT datasets and the system outputs from the LOLS (Lampouras and Vlachos, 2016) 8 and RNNLG (Wen et al., 2015) 9 NLG systems. These sentences describe restaurant types, locations and categories to convey information given in a slot+value meaning representation. This provided 1954 unique sentences. We used the annotations for naturalness that human evaluators had provided on a 6-point Likert scale in response to the question 'Could the utterance have been produced by a native speaker?'. For each unique system-generated response we took the mean naturalness score across the different annotators. As our model was trained for the task of identifying data as "perfect" versus "imperfect", we set a high threshold for naturalness: responses with a mean naturalness rating of greater than or equal to 5 and no single naturalness score below 5 were set with a ground-truth of perfect. This resulted in 426 "perfect" targets out of 1954 sentences. Using the model described above to predict the naturalness of this dataset resulted in an accuracy of .70 and a weighted F1 of .69. As a baseline for this dataset, always predicting 'imperfect' would have an accuracy of .78 and a weighted F1 of .68. Additionally, we used our classifier training and testing pipeline on this dataset, training on two thirds of the data (1309 sentences) and testing on the other third (645 sentences, of which 126 were 'perfect'). This surpassed the baseline for this dataset: across ten repetitions the mean weighted F1 was .73 and accuracy was .83. Repeating the exercise with just the top 11 features identified during the Feature Analysis above also  surpassed the baseline though was lower than the full feature set, resulting in a mean weighted F1 of .73 and an accuracy of .80. (always predicting 'imperfect' would achieve an F1 of .72 and accuracy of .80) The model's predictions for this test set correlated weakly with the mean naturalness score with a Spearman's ⇢ of 0.23 (p < 0.001) ( Table 3). Though this correlation is not very strong, it is notable that it is stronger than the correlation with all the other word-overlap metrics investigated by (Novikova et al., 2017a) and does not require a reference text to achieve this.
We also tested transferabiltiy with data from the WebNLG challenge 10 (Gardent et al., 2017) in order to test on more diverse content about different topics. The WebNLG data consists of sets of triples extracted from DBPedia across 15 different categories carefully designed to be varied. Utterances generated by WebNLG Challenge entrants underwent human annotation by participants from English-speaking countries. We used the annotations for fluency and grammaticality 11 which were graded separately, each on a three-point Likert scale. We set the ground truth of 'perfect' for those sentences which had a mean fluency and grammaticality annotation greater than or equal 2.6 with no single annotation lower than 2. This gave us 1959 unique sentences of which 624 were 'perfect'. Our original model's predictions resulted in an accuracy of 0.68 and a weighted F1 of 0.61. A baseline for this dataset that always predicted 'imperfect' would have an accuracy of 0.78 and an F1 of 0.55. As with the E2E set, performance 10 http://webnlg.loria.fr/pages/challenge.html 11 https://gitlab.com/shimorina/webnlg-human-evaluation/  We use the Bleu scores that had been calculated using the dataset's reference sentences to compare Bleu's correlation with fluency and grammaticality judgments and the correlation with our model's predictions. The original model correlates very weakly with mean fluency score (Spearman's ⇢ 0.08, p <0.001) and does not correlate significantly with mean grammaticality score p >0.05). However, when trained on this task, the model's predictions were moderately and significantly positively correlated with the mean fluency and grammaticality ratings ( Table 4). The correlation with Bleu is weaker on this test set: trained on data from this task, we achieve better correlation with fluency and in particular grammaticality judgments than Bleu.
This exercise shows that while our model may have limited direct transferability when there are significant differences between the type of sentences seen in the training data domain versus the test, our feature-based method and feature set are more transferable than the model itself. When trained on data for a different task, different features from the set can contribute to identifying what constitutes a high quality sentence in this genre. This approach could be used to evaluate the naturalness of generated text for a particular task by using a small set of human-annotated data to train a model that can cheaply and easily be used over a larger quantity of data to given an indication of the naturalness.

Conclusions and Future Work
We presented a linguistic feature based approach to automatic naturalness evaluation of machine generated text, building on findings from L2 assessment research. Our experiment using a small dataset showed promising results suggesting that this is a viable path towards scalable naturalness evaluation of machine-generated text, with potential for interpretability which can help identify and prioritize improvements to an NLG system during development. In future work, we aim to extend this approach to outputs in multiple languages and multiple domains to further assess the transferability of the approach and of specific models. We will go beyond a binary classification of "perfect" versus "imperfect" to better account for cases where there is inter-speaker variation in naturalness judgments. We also plan to investigate improving deep neural models by adopting recent advancements in contextualized deep word and sentence embeddings (Peters et al., 2018;Perone et al., 2018) and transfer learning in sentence representation (Howard and Ruder, 2018;Radford et al., 2018).