Native-like Expression Identification by Contrasting Native and Proficient Second Language Speakers

We propose a novel task of native-like expression identification by contrasting texts written by native speakers and those by proficient second language speakers. This task is highly challenging mainly because 1) the combinatorial nature of expressions prevents us from choosing candidate expressions a priori and 2) the distributions of the two types of texts overlap considerably. Our solution to the first problem is to combine a powerful neural network-based classifier of sentence-level nativeness with an explainability method that measures an approximate contribution of a given expression to the classifier’s prediction. To address the second problem, we introduce a special label neutral and reformulate the classification task as complementary-label learning. Our crowdsourcing-based evaluation and in-depth analysis suggest that our method successfully uncovers linguistically interesting usages distinctive of native speech.


Introduction
We propose a novel task of native-like expression identification (NLEI) by contrasting texts written by native and proficient second language (L2) speakers. Our primary motivation lies in the observation that native English speakers often fail to be understood even by proficient L2 speakers (Hazel, 2016). Take the following sentence for example: Could you give me a ballpark figure?
Ballpark figure is a fairly common American English idiom meaning "an approximate figure or quantity". Despite being a simple combination of two basic words, this expression is enigmatic to many L2 speakers, leading to communication breakdowns. We will refer to such expressions as native-like expressions. We believe that native speakers, no less than L2 speakers, would do well to adapt their speech in an international setting so as to maximize mutual comprehension, and that one effective strategy is to avoid native-like expressions of this sort. However, the hegemony of English in international communities has led to many monolingual English speakers lacking the notion of what it is like to speak a second language, making it particularly difficult for them to identify native-like expressions on their own. For this reason, a system that automatically detects native-like expressions would help native speakers improve their international outlook.
NLEI itself has other potential applications. It could prove useful as a tool for advanced language learners to find new, fluent expressions to acquire in any given text. It can also be used as a method of linguistic inquiry for examining the differences between first-and second-language acquisition. If we manage to draw cross-lingually valid generalizations from English data, they would have a significant impact. In many other languages, there has been an increasing recognition of the importance of effective communication with linguistic minorities, especially in emergency situations (Uekusa, 2019), but texts written by L2 speakers are not large enough to enable data-driven approaches.
Figure 1: Overview of our task. (a) A schematic illustration of texts written by native and L2 English speakers. We assume that the two types of distributions are different but nevertheless overlap considerably. We work on identifying a subregion characteristic of native speakers. (b) The proposed method. We train a neural network-based sentence classifier that gives a native score to a given input. After that, we feed a candidate expression, in addition to the input sentence, to the classifier to approximately divide the native score into the contribution made by the candidate expression and that made by the rest of the sentence.
NLEI poses two key technical challenges. First, native-like expressions are often context-sensitive, and more importantly, can consist of several words. In traditional word-based approaches, simple frequencybased statistical tests such as a chi-squared test (Baayen, 2008) can be used to contrast text written by native and proficient L2 speakers. Such statistical tests are not applicable to our task because the combinatorial nature of native-like expressions prevents us from choosing candidate expressions a priori. Second, the distributions of the two types of texts cannot totally be separated because, as illustrated in Figure 1(a), they do overlap considerably. This hampers discriminative approaches to the task that are known to be more powerful than simple statistical tests.
To address the first challenge, we combine a neural network with an explainability method. We build a sentence-level classifier using a BiLSTM with subword embeddings as input, with the hope that it is expressive enough to implicitly exploit native-like expressions for classification. After narrowing down the targets to native-like sentences, we use a method named contextual decomposition (Murdoch et al., 2018; to measure an approximate contribution of a given expression to the classifier's prediction (Figure 1(b)). We repeat this for multiple candidate expressions in a given sentence to choose the most appropriate one.
We address the inseparability problem by introducing a special label neutral to indicate the overlapping region. The resulting three-way classifier can be trained under the framework of complementarylabel learning (Ishida et al., 2017;. By letting non-distinctive sentences be absorbed into neutral, the classifier is able to choose distinctively native-like sentences from sentences written by native speakers.
Due to the exploratory nature of the task, it is hard to evaluate the proposed method. Nevertheless, we performed crowdsourcing for quantitative evaluation, which was followed by in-depth manual analysis of the detected expressions. We found that the scores given by the proposed method weakly correlated with aggregated ratings provided by L2 crowdworkers. Remarkably, the proposed method often identified expressions that consisted of words so basic that the traditional word-based models would have deemed them easy for L2 speakers.

Word-based Models for Second Language Learners
A growing body of research adopts NLP techniques to assist second language learners. While grammatical error correction (Dale and Kilgarriff, 2011;Ng et al., 2013;Ng et al., 2014) has arguably been the most actively studied, a number of researchers have also worked on identifying words that are difficult for learners.
The existing approaches to modeling words can be grouped into type-based and token-based ones.
The goal of type-based approaches is to estimate learners' vocabulary proficiency. To predict whether a learner knows a given word, logistic models based on item response theory are often used (Ehara et al., 2012;Ehara et al., 2013). Token-based approaches are formalized as complex word identification (CWI) (Paetzold and Specia, 2016;Yimam et al., 2018). CWI is a task that aims to identify words in texts that might present a challenge for target readers, who are often but not always second language learners. Our task is closer to CWI in that both tasks are designed to handle context sensitivity: depending on surrounding context, a given expression can convey different meanings and hence can be easy or difficult. As the name suggests, however, the focus of CWI is on words, although phrases are not entirely absent from the data. Our departure from word-based models is motivated by skepticism about the idea that simpler words result in better comprehension. Another major difference is that while CWI is usually framed as a supervised learning task where words in texts are annotated regarding complexity, we assume that in our task, only writer attributes are available as indirect supervision signals. Finally, our overarching goal is different from those of learner-oriented studies in that we aim to change the behavior of native speakers, rather than L2 speakers, in the settings of international communication.

Controlled Languages and Text Simplification
There have been a number of attempts to create lexically and grammatically restricted subsets of English, grouped under the umbrella term "controlled languages" (CL). Although a large portion of CLs are domain-specific, there also exist general-purpose CLs, most notably Basic English (Ogden, 1930)one of the oldest and most influential controlled languages. The vocabulary of Basic English is stripped down from regular English to 850 word forms, with verbs being especially restricted to just 18.
The largest collection of texts that is claimed to use Basic English is the Simple English Wikipedia (SEW). SEW contributors purport to adhere to the principle of using "simpler" vocabulary and avoiding idioms, 1 but this is not strict, and the case-by-case judgments are largely left to writers' and editors' discretion. The surprising unsimplicity of SEW has been pointed out by Xu et al. (2015). Another study concludes that in practice the vocabulary richness of SEW is the same as in regular English Wikipedia (EW), and that the decrease in complexity is mostly due to usage of shorter sentences, while syntax itself is not drastically simplified (Yasseri et al., 2012). Finally, most of the editors seem to be native speakers of English, and the SEW is apparently failing to reach its target audience of L2 speakers, students, and developmentally disabled people. SEW, aligned with EW, is often used for the task of text simplification (Alva-Manchego et al., 2020). A popular subtask of text simplification is lexical simplification, where difficult words are substituted with simple words. In fact, CWI is often treated as a prerequisite of lexical simplification. Although we limit ourselves to detection, paraphrasing is an interesting direction to explore. It should be noted again that our focus on native speakers and longer expressions makes our task unique and distinctive.

Native Language Identification
From a technological point of view, the proposed method has a close connection to native language identification (NLI) (Koppel et al., 2005;Malmasi et al., 2017;Goldin et al., 2018). In its simplest form, NLI is formalized as a binary classification task where the goal is to determine whether the writer is a native or L2 speaker. Goldin et al. (2018) worked on an English corpus in which L2 speakers were highly advanced, almost at the level of native speakers . As we see in Section 5.1, we use this dataset in our task.
Although we also train a classifier, an important difference from NLI is that classification is not our goal but an intermediate task. While Goldin et al. (2018) used as the input a text chunk large enough to reveal the writer's identity, we classify single sentences in order to narrow down potential occurrences of native-like expressions. To this end, we are eager to eliminate the impact of extralinguistic patterns as much as possible while Goldin et al. (2018) used them to improve the classification performance. For example, the place name New Orleans is suggestive of American identity. In our task, however, it is to be masked because it is not a native-like expression.

Native-like Expression Identification
Given texts written by native and proficient L2 speakers, the task of NLEI is to find any possible nativelike expressions in them. We assume that each sentence is tied to a writer-attribute label, native writer or L2 writer . It is important to note that we assume no annotation of native-like expressions themselves.
Due to the exploratory nature of the task, it is difficult to give a precise definition to native-like expressions. Any word or sequence of words is considered native-like if it is more commonly used by native speakers compared to L2 speakers. 2 An additional condition is that native-like expressions must not include domain-specific words or named entities. Since topics that native speakers write about are often different from L2 speakers, domain-specific expressions such as baseball, and electoral wipeout, as well as country names and so on, could help reliably distinguish the two groups. However, since they do not constitute distinctive usages that we are aiming to identify, we do not treat them as native-like expressions.
There are no clear-cut criteria for determining the boundary of a native-like expression. From the sentence,

What on Earth are you yammering on about?
some might want to extract yammering on while others may prefer yammering on about. We decide to select at most one native-like expression per sentence to sidestep the overlap problem. We also adopt a relaxed matching criterion for evaluation.

Proposed Method
We build a neural network-based classifier that receives a sentence as the input and returns a threedimensional vector representing the labels native, neutral, and L2 (Section 4.1). It is trained under the framework of complementary-label learning (Section 4.2). The classifier is then combined with an explainability method named contextual decomposition to locate a native-like expression in a given native-like sentence (Section 4.3).

BiLSTM Classifier of Sentence-level Nativeness
For a given sentence with tokens x 1:N = (x 1 , · · · , x N ), the classifier outputs a three-dimensional score vector y = (y native , y neutral , y L2 ) ∈ R 3 . We build the classifier with a BiLSTM (Graves and Schmidhuber, 2005). After transforming the input into an embedding sequence e 1:N = (e 1 , · · · , e N ), we feed the vectors into the BiLSTM to obtain context-aware representations: We then apply the average pooling and two linear transformations with a hyperbolic tangent activation function to obtain the output vector: y = Linear(tanh(Linear(AvgPooling(h 1:N )))).
For classification, we select argmax i y i ∈ y, but we are more interested in the vector y itself.

Complementary-Label Learning
Even though we try to build a three-way classifier, we only have access to two writer-attribute labels, native writer and L2 writer . The trick we employ here is called complementary-label learning (Ishida et al., 2017;. A complementary label specifies a class to which the input does not belong. In out task, the writer-attribute label native writer is recast as a complementary sentence label not-L2, meaning that the input may belong to either native or neutral but certainly not to L2. Similarly, the writer-attribute label L2 writer is mapped to the complementary sentence label not-native (either neutral or L2). Our setting is unusual for a complementary-label learning task in that we never observe not-neutral, but it does work in practice. 3 To define the loss, we normalize y and project the result into the space of complementary labels, not-native, not-neutral, and not-L2 in this order: where Q is a transition matrix, d is a small discount factor which we found stabilized training. For simplicity, we assume that sentences labeled not-native and not-L2 are equal in size although it is not difficult to handle data imbalance. We compute the cross-entropy loss of g with respect to the complementary label and perform backpropagation to update the parameters.
To gain an insight, suppose that the classifier vacillates between native and neutral during training. Q moves the (normalized) native score directly to not-L2 (i.e., native writer ) while the neutral score is evenly distributed to not-native (L2 writer ) and not-L2 (native writer ). Since the reference label is either not-native or not-L2, the neutral score leads to a stable, moderate loss. That makes native (and L2) a relatively high-risk/high-return bet, yielding either a large loss or a small loss. Thus, the classifier ends up giving a relatively large score to neutral when the input is not distinctive.

Contextual Decomposition for Finding Native-like Expressions
Once the classifier chooses sentences with high native scores, we want to locate native-like expressions in them. Let S ⊆ {1, · · · , N } be a subset of the input that we consider as a candidate expression. We use contextual decomposition (CD) (Murdoch et al., 2018; to calculate the approximate contribution of S to the native score. We repeat this for multiple candidate expressions in a given sentence to choose an appropriate one. The key idea of CD is that if a decomposition operation is defined for every neural network layer, we can propagate the decomposed input to a decomposed output by simply tracing the forward computation. For a vector v going inside the network, let β(v) and γ(v) be the contributions of S and S, respectively (v = β(v) + γ(v)). The decomposition is pretty straightforward for the embedding layer: β(e i ) = e i and γ(e i ) = 0 if i ∈ S; otherwise β(e i ) = 0 and γ(e i ) = e i .
For a linear layer with a weight matrix W and a bias b, the input v Complementary-label learning was originally formalized by Ishida et al. (2017), but we adopt a variant method proposed by , who removed the assumption that complementary labels were uniformly chosen during data construction.
The first terms are, again, straightforward: the weight matrix is multiplied individually by β(v L−1 ) and γ(v L−1 ). The remaining problem is how to partition the bias term.  found that partitioning the bias in proportion to the absolute values of the first terms empirically worked well.
The hyperbolic tangent is non-linear, but the following formulae provide a good linearization of the relationship between the input v L−1 = β(v L−1 ) + γ(v L−1 ) and the output v L = β(v L ) + γ(v L ).
Murdoch et al. (2018) elaborate a decomposition operation for the LSTM. It is easy to extend the operation to the BiLSTM because the concatenation of the forward and backward hidden vectors is linear. The remaining layer, average pooling, is also linear. In this way, the output y is decomposed into β(y) and γ(y). We use s CD = β native (y) as the CD score of the candidate expression.

Data and Preprocessing
We used the L2-Reddit corpus , a collection of native and L2 English sentences extracted from Reddit, a community-driven discussion website. On some portions of Reddit, a user can indicate his/her country of origin with a metadata tag; it can be used to assume the user's native language. 4 All submissions by such users were extracted and split into sentences, resulting in a corpus of over 250 million sentences produced by 45,000 users from 50 countries. The label native writer was assigned to sentences produced by users from Australia, Canada, Ireland, New Zealand, the United Kingdom, and the U.S. while the remaining sentences were given the label L2 writer .
A notable feature of the L2-Reddit corpus is that the proficiency level of L2 English utterances tends to be very high: spelling and grammar mistakes are very uncommon, and the syntactic complexity and use of colloquialisms makes L2 nearly indistinguishable from native-produced utterances. This gives us reason to hope that the expressions that get large native scores from our method will provide an insight into the distinctive features of native English, since in most other respects the utterances are very similar.
To reduce the impact of noisy data on the classifier, we removed certain kinds of sentences from the dataset: a) very long sentences (more than 80 tokens), b) very short sentences (less than 6 tokens), c) repeating sentences (found more than 5 times in the corpus), d) sentences containing too much punctuation (more than 15% of total tokens in the sentence), and e) sentences containing too many named entities (see below; more than 25% of total tokens in the sentence).
To further minimize differences between the native and L2 sentences in the corpus, we masked any named entities -most often proper nouns such as Bay Area, Canadians, Angela Merkel, etc. This would have the effect of reducing the classifier's ability to rely on country-specific topical content and other "easy" clues to as to whether a sentence is native-produced or not. The named entities were detected using several algorithms from the Stanford CoreNLP package (Manning et al., 2014).
As a result of preprocessing, the original corpus of some 240M sentences was reduced to 146M sentences (or about 73M each of native and L2 sentences). The training, validation, and test subsets amounted, respectively, to 95%, 1%, and 4% of the final dataset.

Details of the Proposed Method
We tokenized sentences into WordPiece (Wu et al., 2016) subword tokens with a 30,000 token vocabulary, and used the pre-trained 768-dimensional embeddings provided by Devlin et al. (2019), available through the transformers 5 package. We did not update the embeddings during training because, somewhat surprisingly, this setting yielded more intuitively plausible CD scores than a fine-tuned model and a cold-start model. We used a BiLSTM with 200-dimensional hidden vectors, the Adam optimizer (Kingma and Ba, 2015) with the learning rate of 0.001, the batch size of 1024 sentences, and the discount parameter d of 0.001. We selected the model state after 3 epochs of training because it got the best validation score.
The classifier labeled the sentences in the test data as follows: 8.2% for native, 81.7% for neutral, and 10.2% for L2. This result was consistent with our observation that the distributions of sentences written by native and L2 speakers had a huge overlap. Removing sentences labeled neutral, we focused on classification performance for the remaining sentences. The output label native was judged correct if the given sentence was tied to native writer . For the native label, we obtained the precision of 79.5%, the recall of 81.1%, and the F1 score of 80.3%. In preliminary experiments, we confirmed that a baseline binary classifier of the writer-attribute labels topped out just around 60% in terms of accuracy. The large jump in performance indicates that the proposed three-way classifier successfully identified distinctively native-like sentences by letting non-distinctive sentences be absorbed into neutral.
Next, we identified native-like expression from sentences in the training data. We began by selecting native writer sentences classified as native. There were multiple ways to enumerate candidate expressions, including systems for constituency parsing and multi-word expression identification. For simplicity, however, we opted for extracting all up-to-5-grams (unigrams, bigrams, ... 5-grams) within the sentence. We selected sentences for which the largest CD score was no less than 0.7. In many cases, more than one candidate expression in a sentence surpassed this threshold. To avoid the overlap problem, and for reasons of evaluation convenience, we extracted only the first candidate. With the threshold of 0.7 for the CD scores, about 71% of the native-like sentences remained.

Crowdsourcing-based Evaluation
To quantitatively evaluate the results, we asked L2 crowdworkers to rate the detected expressions. Before the actual evaluation, we conducted additional rule-based filtering and some screening by native speakers of English, as explained in detail in Appendix A.
We hired crowdworkers on the Amazon Mechanical Turk platform. Since this platform did not provide an option to filter workers by native language, we resorted to using country of residence as a proxy. We only accepted answers from workers not residing in Australia, Canada, Ireland, New Zealand, the United Kingdom, or the U.S., the same six countries chosen for the L2-Reddit corpus.
The task screen is shown in Figure 2. For each sentence, five crowdworkers were asked to read the sentence with the relevant expression highlighted and to answer a multiple-choice question, roughly ranging from least to most familiar: hard-sent, hard, known, and used as shorthand.
It turned out that the L2 crowdworkers were, or at least pretended to be, highly proficient in English. The distribution of the answers was skewed toward used: 3.8% for hard-sent, 8.9% for hard, 10.0% for known, and 78.3% for used. Ignoring answers with hard-sent, we aggregated multiple answers for each sentence into a single scalar value: s L2 = 0 × n hard + 1 × n known + 2 × n used n hard + n known + n used , where n * is the corresponding answer counts. The lower s L2 is, the less familiar the expression is to L2 speakers. For comparison, we employed L2 vocabulary knowledge data collected by Ehara et al. (2013). Built on top of the Standard Vocabulary List (SVL12000), a list of fundamental words for English learners, this dataset was constructed by asking 16 learners of English to rate their degree of knowledge of each word (lower is less familiar). For each word, we took an average of the 16 scores. To apply the word-based rating to expressions, we chose the word with the lowest score (i.e., least familiar) and took that to be the score for the entire expression. We refer to this score as s word .
Note that we performed lemmatization for word lookup. We found this process error prone, as there were many false negatives. For a fair comparison, we dropped expressions for which we failed to find any words in the list, leading to 7% reduction of the annotated dataset.
s CD exhibited a weak negative correlation with s L2 , with Pearson's r = −0.26. For comparison, s word had a stronger correlation of 0.37, also in the range of weak correlation. s word 's relatively strong performance is understandable given that, albeit word-based, it reflects direct human supervision while the proposed method only exploits sentence-level signals. It is interesting that s CD correlated very weakly with s word (r = −0.09), indicating that the two methods explored very different phenomena. We investigate this point in the next section.

In-depth Analysis
We manually evaluated a random sample of 100 sentences, from those for which an expression with the CD score at or above 0.7 was found. According to our subjective analysis, 47 expressions were good (i.e., contained linguistically interesting usages), and the remaining 53 expressions were classified into three groups: 36 for domain-specific expressions, 5 for named entity detection failure, and 12 for others. 6 Looking first at the not-good expressions, the domain-specific expressions constituted the largest group. These expressions tended to consist of (or contained) words and collocations related to sports, local politics, legal matters, academic affairs, and other subjects that the native speakers in the data were more likely to talk about: No, its pretty damn conclusive, that gun control and gun ownership have no effect on a nations violence rate.
The debate over gun control and gun ownership is a uniquely American topic, but the phrase itself is not linguistically interesting because its meaning is transparent to L2 speakers. Unlike named entities, common nouns can hardly be masked at the preprocessing step. A possible solution to this problem is to perform domain-adversarial training (Ganin et al., 2016) in a topic-aware manner. Additional metadata provided by Reddit may help.
There were also occasional named entities that were not detected by the NER pipeline, often due to lack of proper capitalization in the original sentence. Most of these are of little interest to us: Oh no, the packers are infighting?
(team name) The last group included the expressions that did not fit neatly into the other two categories. Some of these were hard to understand or interpret qualitatively (and may thus be treated as noise), while others simply did not seem interesting enough to be considered good.
As for good expressions, we found only 10 out of 47 expressions containing difficult-looking words: They started with a wildly lopsided pitching match-up and kept it respectable.
Seems to be privy to things that normal bloggers are not.
This is a bit surprising but consistent with the very weak correlation between s CD and s word . Indeed, many expressions detected by the proposed method were combinations of basic words: (1) Good luck with it, but I think the location is going to be a hard sell.
(2) Do you find that the objectionable ones get better as the glasses and bottles sit out a bit?
(3) The ATM at my old work would occasionally toss out bills stuck together.
It is no wonder that s word wrongly indicated high familiarity for the underlined expressions. Even if not incomprehensible, these expressions appear to require some effort from L2 speakers to be understood. In the case of sentence (1) above, the expression contains a less-common idiom to be a hard sell, which in this case means "something that it is difficult to persuade people to buy or accept". 7 In this case, context will need to be taken into account for correct comprehension. Sentence (2) is harder still: the phrasal verb sit out seems to be used creatively to mean that the glasses and bottles, ostensibly containing some sort of beverage, need to stay undisturbed for a while. This idiosyncratic usage, compounded with a bit denoting a relatively short span of time, can prove exceptionally hard for an L2 speaker to "parse out", especially if part of a spoken utterance.
Moreover, it is highly unlikely that L2 speakers can easily imitate the way the basic words are combined. To express a given idea, they would have chosen words with more prototypical meanings in mind and combined them in a more semantically compositional (Crossley and McNamara, 2009) and syntactically straightforward (Wray, 1999) manner. The expression in sentence (3) contains both a creative choice of phrasing -toss out instead of, say, return, and a reduced relative clause attached to a nounbills stuck together. This latter construction often causes garden-path effects in reading comprehension, and is likely to present difficulties for L2 speakers especially (Juffs, 1998).
In some cases, the detected expressions were not satisfactory with respect to coverage, but nevertheless, the sentences themselves were quite native-like: Throwing out more than what you want in a negotiation doesn't take smarts.
Its used for casual diving off New Jersey but is currently at quite a depth.
Again, none of the words in these sentences appear difficult in isolation but the way they are used is. To sum up, our method successfully uncovers linguistically interesting usages distinctive of native speech that would slip through unnoticed by traditional word-based models.

Conclusion
In this paper, we proposed a novel task of native-like expression identification. We showed that even if no expression-level annotation is available, a powerful neural network combined with an explainability method can detect native-like expressions by contrasting sentences written by native and proficient L2 speakers. Our analysis on expressions detected by the proposed method suggests that even basic words may give some pause to L2 speakers if they are arranged in certain ways.
Although the proposed method in its current form is sufficiently useful for mining native-like expressions, further work is needed to mitigate the effect of domain-specific expressions. An interesting future direction is to create a user interface through which the system gets human feedback. Such direct signals are expected to make the detected expressions more closely match human intuitions (Rieger et al., 2019). than 0.5. 2) We created another set of 10,000 sentence-expression pairs for which the top CD score was not necessarily greater than 0.5 (i.e., we selected random expressions with random scores).
To make the data more appropriate for human evaluation, we removed the sentence-expression pairs in the following cases: • when the sentence contained a profanity; • when the sentence was longer than 30 tokens; • when the expression contained a named entity; • when the expression consisted of a single stopword 8 or a non-alphabetical character; This left us with 10,877 sentences from the original 20,000. From these remaining sentences, we randomly selected 1,000 sentences.
Next, we asked native crowdworkers to check the sentences. We hired native crowdworkers on Amazon Mechanical Turk. Workers residing in the above-listed six countries were treated as native speakers of English. The task screen for native crowdworkers is shown in Figure 3. The instruction at the top was the same as that given to L2 workers, but this time we provided six options, roughly ranging from least natural to most natural.
As expected, a large majority of the expressions were familiar and natural to the native crowdworkers, as both the readers and the writers were native speakers. The workers chose the most-natural option in four out of five cases (3968 out of 5000 answers total). This indicates that most of the sentences and expressions in the selected subset were very familiar, natural-sounding, and easy to understand for native speakers of English.
Some of the sentences did get bad scores, however. Because we are not interested in sentences that even native speakers cannot make sense of, we removed these from the subsequent evaluation. Specifically, we removed 26 sentences for which at least two workers out of five gave the first three options. We also removed several sentences for which less than five answers were given (due to issues with the crowdsourcing interface). As a result, 958 sentences remained.
As described in Section 5.3, we removed sentences for which none of the words in the target expression matched SVL12000. This reduced the number of sentences to 890. Applying the CD score threshold of 0.7, we chose 287 sentences as the final dataset.