Developing ASR for Indonesian-English Bilingual Language Teaching

Usage-based analyses of teacher corpora and code-switching (Boztepe, 2003) are an important next stage in understanding language acquisition. Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. Using quantitative methods to understand language learning and teaching is difficult work as the ‘transcription bottleneck’ constrains the size of datasets. We found that using an automatic speech recognition (ASR) toolkit with a small set of training data is likely to speed data collection in this context (Maxwelll-Smith et al., 2020).

Usage-based analyses of teacher corpora and codeswitching (Boztepe, 2003) are an important next stage in understanding language acquisition. Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. Using quantitative methods to understand language learning and teaching is difficult work as the 'transcription bottleneck' constrains the size of datasets. We found that using an automatic speech recognition (ASR) toolkit with a small set of training data is likely to speed data collection in this context (Maxwelll-Smith et al., 2020).
For this study we used approximately 150 minutes of data from a project recording a single teacher's speech in a second-year, tertiary Indonesian language program. Our methodological considerations addressed the following: which ASR tool to use, how to prepare training data for this tool, and how to best manage the bias of the training data inherent in all transcription processes.
We chose the Elpis ASR system, which combines user-friendly data processing scripts with a Kaldi HMM/GMM (Hidden Markov Model/Gaussian Mixture Model) recipe. Elpis generates transcripts as time-aligned ELAN files, which was a good fit with the broader project investigating Indonesian language teaching.
A team of transcribers established guidelines which reflexively responded to a range of methodological considerations. Indonesian diglossic variants exist in a highly diverse linguistic ecosystem (Djenar and Ewing, 2015;Sneddon, 2003;Goebel, 2010). This was highlighted by transcriber subjectivity in the teaching context. For example, the task of analyzing and choosing orthography to transcribe teacher speech into over-simplified, binary L1 versus L2 categories (1st language: English, 2nd language: Indonesian) is influenced by transcriber expectations of language norms in 'high' vs. 'low' varieties of Indonesian. Further, the goal of modifying sociolinguistic norms which brings people to language classrooms precipitated a level of variance and unpredictability unusual in other speech contexts as teachers respond to student acquisition processes. We also provided examples of the development of a Community of Practice (Wenger, 1998) as another layer of complexity in the group classroom environment.
The dataset was transcribed using several "tiers" to create parallel structures for storing data. While predominately working from a code-switching paradigm, the data structure allowed us to train multiple models for comparative evaluation. We trained three models, two of which included all training data and multi-lingual pronunciation lexicons, resonating with work on translanguaging in educational settings (Garcia and Wei, 2014). The third model was trained with Indonesian data only. Our preliminary result of 64% word error rate (WER) is high in comparison to mono-lingual ASR systems (Maxwelll-Smith et al., 2020). However, WERs from code-switch bilingual data (Biswas et al., 2019) were more similar to our WER, especially given our small amount of training data.
By analysing the text spans in the machine transcription, we found a high incidence of resyllabification (word splitting), particularly with omission of initial or middle consonants. The analysis identified which model would include less disruptive errors than the others, and which would be more responsive to the addition of further training data.
The application of ASR tools is limited in this setting given the small set of training data, however using these tools has potential to expedite the transcription of teacher corpora. These tools could change workflow and decrease cognitive load for human transcribers by generating a draft transcript for revision. We highlight some of the benefits and risks of using these emerging technologies to analyze the complex work of language teachers, and in education more generally.