Assessment of an Index for Measuring Pronunciation Difficulty

This study assesses an index for measur-ing the pronunciation difficulty of sen-tences (henceforth, pronounceability) based on the normalized edit distance from a reference sentence to a transcrip-tion of learners’ pronunciation. Pro-nounceability should be examined when language teachers use a computer-assisted language learning system for pronunciation learning to maintain the motivation of learners. However, unlike the evaluation of learners’ pronunciation performance, previous research did not focus on pronounceability not only for English but also for Asian languages. This study found that the normalized edit distance was reliable but not valid. The lack of validity appeared to be because of an English test used for determining the proficiency of learners.


Introduction
Research on computer-assisted language learning (CALL) has been carried out for learning the pronunciation of European languages as a foreign language such as English (Witt & Young 2002, Mak et al. 2004, Ai & Xu 2015, Liu & Hung 2016 and Swedish (Koniaris 2014). CALL research on Asian languages has considered Japanese as a foreign language (Hirata 2004) and Chinese as a foreign language (Zhao et al. 2012). The primary goal of CALL systems for the learning of foreign language pronunciation is to resolve interference from the first language of learners. For instance, a CALL system can analyze the speech in which a learner reads English sentences aloud and presents pronunciation errors that a learner must read aloud again for reducing the errors.
Even though the methods of evaluating learners' pronunciation performance have received considerable attention in previous research, the pronunciation difficulty of sentences (henceforth, pronounceability) has not been examined extensively. Given that readability and the difficulty of listening influence learners' motivation and outcomes (Hwang 2005, Lai 2015, Yoon et al. 2016, we consider that CALL for pronunciation learning should consider pronounceability in evaluating learners' pronunciation. Pronounceability can be represented as the phonetic edit distance from reference pronunciation to a learner's expected pronunciation based on the proficiency. Phonetic edit distance can be measured using a modified version of the Levenshtein edit distance (Wieling et al. 2014) or a deep-neural-network-based classifier (Li et al. 2016).
This study measured normalized edit distance (NED) using the orthographical transcription of learners' pronunciation of reference sentences. An advantage of the NED based on orthographic transcription is the availability of data. This is because language teachers can obtain orthographical transcription without being trained for phonetic transcription.
This study measures pronounceability using multiple regression analysis considering orthographic NED as a dependent variable and the features of a sentence and a learner as independent variables. First, a corpus for multiple regression analysis is developed. This corpus includes the data for NED and the proficiency data in a scorebased scale of Test of English for International Communication (TOEIC). TOEIC is a widely used English test in Asian countries, and its test score ranges from 10 to 990. In previous research (Grahma et al. 2015, Delais-Roussarie 2015, Gósy et al. 2015, proficiency was demonstrated using a point-scale such as the Common European Framework of Reference for Languages (six levels from A1 to C2).
This study assessed our phonetic learner corpus data by answering the following research questions:  How stable is NED as a pronounceability index?
 To what extent does NED classify learners depending on their proficiency?
 How strongly does NED correlate with a learner's proficiency?
 How accurately is NED measurable based on linguistic and learner features for pronounceability measurement?
2 Compilation of Phonetic Learner Corpus

Collection of Pronunciation Data
Our phonetic learner corpus was compiled by recording pronunciation data for English texts that learners read aloud sentence by sentence. In addition, after reading a sentence aloud, learners subjectively determined the pronounceability of sentences on a five-point Likert scale (1: easy; 2: somewhat easy; 3: average; 4: somewhat difficult; 5: difficult) (henceforth, SBJ). The texts for reading aloud (the title of Text I is the North Wind and the Sun and that of Text II is the Boy who Cried Wolf) were selected from the texts distributed by the International Phonetic Association (International Phonetic Association 1999). Even though these texts contain only 15 sentences, they cover the basic sounds of English (International Phonetic Association 1999, Deterding 2006. This enables us to analyze which types of English sounds influence learners' pronunciation. Deterding (2006) reported that Text I failed to cover certain sounds, such as initial and medial /z/ and syllable-initial /θ/, and then developed material that covered the English pronunciation for these sounds by rewriting a well-known fable by Aesop (Text II).
The corpus data were compiled from 50 learners of English as a foreign language at university (28 males, 22 females; mean age: 20.8 years (standard deviation, SD, 1.3)). The learners were compensated for their participation. In our sample, the mean TOEIC score was 607.7 (SD, 186.2). The minimum and maximum scores were 295 and 900, respectively.

Annotation of Pronunciation Data
Our phonetic learner corpus includes NED, the linguistic features of sentences, and learner features.
NED was derived as the Levenshtein edit distance normalized by sentence length. It reflected the differences from the reference sentences to the transcription of learners' pronunciation due to the substitution, deletion, or insertion of letters. Before measuring the edit distance, symbols such as commas and periods were deleted and expressions were uncapitalized in the transcription and reference data.
The pronunciation was manually transcribed by a transcriber who was a native speaker of English and trained to replicate interviews and meetings but was unaccustomed to the English spoken by learners. The transcriber examined the texts before starting the transcription task. The transcriber was required to replicate learners' pronunciation without adding, deleting, and substituting any expressions for improving grammaticality and/or acceptability (except the addition of symbols such as commas and periods).
Linguistic features were automatically derived from a sentence as follows: Sentence length was derived as the number of words in a sentence. Word length was derived as the number of syllables in a word. The number of multiple-syllable words in a sentence were derived by calculating ∑ ( − 1) , where n was the number of words in a sentence, and Si was the number of syllables in the i-th word (Fang 1966). This derivation eliminated the presence of single-syllable words. Word difficulty was derived as the rate of words not listed in a basic vocabulary list (Kiyokawa 1990) relative to the total number of words in a sentence.

Properties of Phonetic Learner Corpus
Our phonetic learner corpus was compiled using the method described in Section 2, and this corpus included 750 instances (15 sentences read aloud by 50 learners). Table 2 shows the descriptive statistics for NED and SBJ in the phonetic learner corpus.
The relative frequency distributions of NED and SBJ, in which NED was classified into five levels based on SBJ, are shown in Figure 1. The distributions are dissimilar, as the peak of NED appears at pronounceability level 2 ("somewhat easy") while that of SBJ appears at pronounceability level 3 ("average"). If NED appropriately accounts for learners' pronounceability, learners appear to overvalue pronounceability. On the contrary, if NED fails to explain pronounceability, learners appear to undervalue pronounceability. This provides a solution for the improvement of NED.

Assessment of NED as a Pronounceability Index
In Sections 4.1, 4.2, and 4.3, research questions 1-3 are assessed using the classical test theory (Brown 1996). The fourth question is answered in Section 4.4.

Reliability of NED
The reliability of NED was examined through internal consistency in terms of Cronbach's α (Cronbach 1970). Internal consistency refers to whether NED demonstrates similar results for sentences with similar pronounceability. Cronbach's α is a reliability coefficient defined by the following equation: = 1 − ∑ , where k is the number of items (sentences in this study), is the variance associated with item i, and is the variance associated with the sum of all k item values. Cronbach's α reliability coefficient ranges from 0 (absence of reliability) to 1 (absolute reliability), and empirical satisfaction is achieved with values above 0.8.
As reliability depends on the number of items, the reliability coefficients were derived individually for each text (Text I containing 5 sentences and Text II containing 10 sentences) and jointly for both texts. The reliability coefficients of NED and SBJ are shown in Table 3.
The reliability coefficient of NED exceeded the value required for empirical satisfaction (α = 0.8) in Text II and Texts I & II. Hence, NED is partially reliable as a pronounceability index. However, NED demonstrated lower reliability compared to SBJ. This suggests that NED should be improved through modification.

Construct Validity of NED
Construct validity was examined from the viewpoint of distinctiveness. If NED appropriately reflects learners' proficiency, NED should demonstrate a statistically significant difference (p < 0.01) among learners at different proficiency levels. Our phonetic learner corpus data were classified into three levels based on the TOEIC scores below 490 (beginner level) (n = 240), below 730 (intermediate level) (n = 240), and 730 or above (advanced level) (n = 270). Table 4 shows the mean (SD) values of NED and SBJ for the three levels. The distinctiveness of NED was investigated using ANOVA. ANOVA showed statistically significant differences between the three levels of learners for SBJ (F (2, 747) = 10.13, p < 0.01) but not for NED (F (2, 747) = 0.55, p > 0.01). NED failed to demonstrate construct validity depending on TOEIC-based proficiency.

Criterion-related Validity of NED
Criterion-related validity was examined from the viewpoint of the correlation with learners' proficiency in terms of TOEIC scores. NED should reflect learners' proficiency because pronounceability should depend on learners' proficiency. Then, the correlation between NED and TOEIC scores and between SBJ and TOEIC scores was examined.
NED exhibited weaker correlation with TOEIC scores (r = -0.04) compared to SBJ (r = -0.20). Owing to this, NED failed to demonstrate criterion-related validity depending on TOEIC-based proficiency.

Pronounceability Measurement
Pronounceability was measured through multiple regression analysis. NED was the dependent variable, and the linguistic and learner features described in Section 2 were the independent variables. However, multiple-syllable words were not used owing to the variance inflation factor (VIF = 12.3) (Kutner et al. 2002). A significant regression equation was found (F (4, 745) = 124.15, p < 0.01) with an adjusted squared correlation coefficient (R 2 ) of 0.40, which indicates that the equation measured approximately 40% of the pronounceability.
The contribution of linguistic and learner features can be observed using standardized particle regression coefficients; the contribution increases with the absolute value of the coefficients. The standardized partial regression coefficients are summarized in Table 5. Significant contribution is observed in word difficulty but not in the other features. This result contradicts the finding of previous research, which reported the significant contribution of sentence length and word length in other modes such as readability (Crossley et al. 2017) and listening difficulty (Messerklinger 2006).
The pronounceability measurement method was examined n times (n = 750) using a leaveone-out cross validation test, considering one instance as test data and n -1 instances as training data. The measured NED exhibited moderate correlation with the observed NED (r = 0.63). NED demonstrated a low coefficient of determination and low predictability.

Conclusion
This study assessed whether NED appropriately demonstrated pronounceability for learning the pronunciation of English as a foreign language. The assessment suggests that NED is reliable (Section 4.1) but not valid (Sections 4.2 and 4.3). The results of pronounceability measurement (Section 4.4) suggest that NED was appropriately explained by the word difficulty.
In future, we will work on the improvement of pronounceability measurement in English based on NED and investigate pronounceability measurement in Asian languages as a foreign language.