Automated Speech Recognition Technology for Dialogue Interaction with Non-Native Interlocutors

Dialogue interaction with remote inter-locutors is a difﬁcult application area for speech recognition technology because of the limited duration of acoustic context available for adaptation, the narrow-band and compressed signal encoding used in telecommunications, high variability of spontaneous speech and the processing time constraints. It is even more difﬁcult in the case of interacting with non-native speakers because of the broader allophonic variation, less canonical prosodic patterns, a higher rate of false starts and incomplete words, unusual word choice and smaller probability to have a grammatically well formed sentence. We present a comparative study of various approaches to speech recognition in non-native context. Comparing systems in terms of their accuracy and real-time factor we ﬁnd that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with on-line speaker adaptation by far outperforms other available methods.


Introduction
Designing automatic speech recognition (ASR) and spoken language understanding (SLU) modules for spoken dialog systems (SDSs) poses more intricate challenges than standalone ASR systems, for many reasons. First, speech recognition latency is extremely important in a spoken dialog system for smooth operation and a good caller experience; one needs to ensure that recognition hypotheses are obtained in near real-time. Second, one needs to deal with the lack of (or minimal) context, since responses in dialogic situations can often be short and succinct. This also means that one might have to deal with minimal data for model adaptation. Third, these responses being typically spontaneous in nature, often exhibit pauses, hesitations and other disfluencies. Fourth, dialogic applications might have to deal with audio bandwidth limitations that will also have important implications for the recognizer design. For instance, in telephonic speech, the bandwidth (300-3200 Hz) is lesser than that of the hifidelity audio recorded at 44.1 kHz. All these issues can drive up the word error rate (WER) of the ASR component. In a recent study comparing several popular ASRs such as Kaldi (Povey et al., 2011), Pocketsphinx (Huggins-Daines et al., 2006 and cloud-based APIs from Apple 1 , Google 2 and AT&T 3 in terms of their suitability for use in SDSs, In (Morbini et al., 2013) there was found no particular consensus on the best ASR, but observed that the open-source Kaldi ASR performed competently in comparison with the other closed-source industry-based APIs. Moreover, in a recent study, (Gaida et al., 2014) it was found that Kaldi significantly outperformed other opensource recognizers on recognition tasks on German Verbmobil and English Wall Street Journal corpora. The Kaldi online ASR was also shown to outperform the Google ASR API when integrated into the Czech-based ALEX spoken dialog framework (Plátek and Jurčíček, 2014).
The aforementioned issues with automatic speech recognition in SDSs are only exacerbated in the case of non-native speakers. Not only do non-native speakers pause, hesitate and make false starts more often than native speakers of a language, but their speech is also characterized by a broader allophonic variation, a less canonical prosodic pattern, a higher rate of incomplete words, unusual word choices and a lower probabil- ity of producing grammatically well-formed sentences. An important application scenario for nonnative dialogic speech recognition is the case of conversation-based Computer-Assisted Language Learning (CALL) systems. For instance, Subarashii is an interactive dialog system for learning Japanese (Bernstein et al., 1999;Ehsani et al., 2000), where the ASR component of the system was built using the HTK speech recognizer (Young et al., 1993) with both native and non-native acoustic models. In general, the performance of the system after SLU was good for in-domain utterances, but not for outof-domain utterances. As another example, in Robot Assisted Language Learning (Dong-Hoon and Chung, 2004) and CALL applications for Korean-speaking learners of English (Lee et al., 2010), whose authors showed that acoustic models trained on the Wall Street Journal corpus with an additional 17 hours of Korean children's transcribed English speech for adaptation produced as low as 22.8% WER across multiple domains tested. In the present work, we investigate the online and offline performance of a Kaldi Large Vocabulary Continuous Speech Recognition (LVCSR) system in conjunction with the opensource and distributed HALEF spoken dialog system (Mehrez et al., 2013;Suendermann-Oeft et al., 2015). , a web server running Apache Tomcat, and a speech server, which consists of an MRCP server (Prylipko et al., 2011) in addition to text-to-speech (TTS) engines-Festival (Taylor et al., 1998) and Mary (Schröder and Trouvain, 2003)-as well as support for Sphinx-4 (Lamere et al., 2003) and Kaldi (Povey et al., 2011) ASRs. In contrast to Sphinx-4 which is tightly integrated into the speech server code base, Kaldi-based ASR is installed on an own server, which is communicating with the speech server via TCP socket. The advantages of this design decision are (a) the ease of management of the computational resources, required by Kaldi when operating in realtime mode (including the potential use of Graphical Processing Units (GPUs)), which could otherwise interfere with the other processes running on the speech server (audio streaming, TTS, Session Initiation Protocol (SIP) and Media Resource Control Protocol (MRCP) communication) and (b) the ease to test the very speech recognizer used in the live SDS also in the offline mode, for example for batch experiments. Often ASR configurations in live SDSs differ from batch systems that may result in different behaviour w.r.t. WER, latency, etc.

System description
In this paper, we will be focusing specifically on evaluating the performance of the Kaldi ASR system within HALEF (we have already covered the Sphinx version in the papers cited above). We generally follow Kaldi's WSJ standard model generation recipe with a few modifications to accommodate our training data. The most sophisticated acoustic models are obtained with speaker adaptive training (SAT) on the feature Maximum Likelihood Linear Regression (fMLLR)-adapted data.
We use about 780 hours of non-native English speech to train the acoustic model. The speaker population covers a diversity of native languages, geographical locations and age groups. In order to match the audio quality standard of the Public Switched Telephone Network (PSTN), we reduce the sampling rate of our recordings down to 8 kHz. The language model was estimated on the manual transcriptions of the same training corpus consisting of ≈ 5.8 million tokens and finally was represented as a trigram language model with ≈ 525 thousand trigrams and ≈ 605 thousand bigrams over a lexicon of ≈ 23 thousand words which included entries for the most frequent partially produced words (e.g. ATTR-; ATTRA-; ATTRAC-; ATTRACT; ATTRACT-; ATTRACTABLE). Ultimately, the final decoding graph was compiled having approximately 5.5 million states and 14 million arcs.
The default Kaldi speech recognizer use case is oriented towards optimal performance in transcription of large amounts of pre-recorded speech. In these circumstances there exists a possibility to perform several recognition passes and estimate the adaptation transformation from a substantial body of spoken material. The highest performing Deep Neural Network (DNN) acoustic model ("nnet2" in Kaldi notation) requires a prior processing pass with the highest performing Gaussian Mixture Model (GMM, "tri4b" in Kaldi notation), which in turn requires a prior processing pass with the same GMM in the speaker-independent mode.
However, in the dialogue environment, it is essential to be able to produce recognition results with the smallest possible latency and little adaptation material. That is the main reason for us to look for alternatives to the mentioned approach. One such possibility is to use the DNN acoustic model with un-adapted data and constrain its output via a speaker-dependent i-Vector (Dehak et al., 2011). This i-Vector contains information on centroids of the speaker-dependent GMM. The i-Vector can be continuously re-estimated based on the available up-to-the-moment acoustic evidence ("online" mode) or after presentation of the entire spoken content (the so called "offline" mode).

Experiments
The evaluation was performed using vocal productions obtained from language learners in the scope of large-scale internet-based language assessment. The production length is a major distinction of this data from the data one may expect to find in the spoken dialogue domain. The individual utterance is a quasi-spontaneous monologue elicited by a certain evaluation setup. The utterances were collected from six different test questions comprising two different speaking tasks: 1) providing an opinion based on personal experience and 2) summarizing or discussing material provided in a reading and/or listening passage. The longest utterances are expected to last up to a minute. The average speaking rate is about 2 words per second. Every speaker produces up to six such utterances. Speakers had a brief time to familiarize themselves with the task and prepare an approximate production plan. Although in strict terms, these productions are different from the true dialogue behavior, they are suitable for the purposes of the dialogic speech recognition system development.
The evaluation of the speech recognition system was performed using the data obtained in the same fashion as the training material. Two sets are used: the development set (dev), containing 593 utterances (68329 tokens, 3575 singletons, 0% OOV rate) coming from 100 speakers with the total amount of audio exceeding 9 hours; and the test set (test), that contains 599 utterances (68112 tokens, 3709 singletons, 0.18% OOV rate) coming from 100 speakers (also more than 9 hours of speech in total). We attempted to have a nonbiased random speaker sampling, covering a broad range of native languages, English speaking proficiency levels, demographics, etc. However, no extensive effort has been spent to ensure that frequencies of the stratified sub-populations follow their natural distribution. Comparative results are presented in Table 1.
As it can be learned from Table 1, the "DNN i-Vector" method of speech recognition outperforms Kaldi's default "DNN fMLLR" setup. This can be explained by the higher variability of non-native speech. In this case the reduced complexity of the i-Vector speaker adaptation matches better the task that we attempt to solve. There is only a very minor degradation of the accuracy with the reduction of the i-Vector support data from the whole interaction to a single utterance. As expected, the "online" scenario loses some accuracy to the "offline" in the utterance beginning, as we could verify by analyzing multiple recognition results.
It is also important to notice that the accuracy of the "DNN i-Vector" system compares favorably with human performance in the same task. In fact, experts have the average WER of about 15% (Zechner, 2009), while Turkers in a crowdsourcing environment perform significantly worse, around 30% WER (Evanini et al., 2010). Our proposed system is therefore already approaching the level of broadly defined average human accuracy in the task of non-native speech transcription.
The "DNN i-Vector" ASR method vastly outperforms the baseline in terms of processing speed. Even with the large vocabulary model in a typical 10-second spoken turn we expect to have only 3 seconds of ASR-specific processing latency. Indeed, in order to obtain an expected de- lay one shall subtract the duration of an utterance from the total processing time as the "online" recognizer commences speech processing at the moment that speech is started. That 3 seconds delay is very close to the natural inter-turn pause of 0.5 -1.5 seconds. Better language modeling is expected to bring the xRT factor below one. The difference of the xRT factor between the "online" and "offline" modes can be explained with somewhat lower quality of acoustic normalization in the "online" case. Larger numbers of hypotheses fit within the decoder's search beam and, thus, increase the processing time.

Conclusions
The DNN i-Vector speech recognition method has proven to be sufficient in the task of supporting a dialogue interaction with non-native speakers. In respect to our baseline systems we observe improvements both in accuracy and processing speed. The "online" mode of operation appears particularly attractive because it allows to minimize the processing latency at the cost of a minor performance degradation. Indeed, the "online" recognizer is capable to start the processing simultaneously with the start of speech production. Thus, unlike the "offline" case, the total perceived latency in the case of "online" recognizer is xRT-1.
There are ways to improve our system by performing a more targeted language modeling and, possibly, language model adaptation to a specific dialogue turn. Our further efforts will be directed to reducing processing latency and increasing the system's robustness by incorporating interpretation feedback into the decoding process.
We plan to perform a comparative error analysis to have a better picture of how our automated sys-tem compares to the average human performance. It is important to separately evaluate WERs for the content vs functional word subgroups; determine the balance between insertions, deletions and substitutions in the optimal operating point; compare humans and machines in ability to recover back from the context of the mis-recognized word (e.g. a filler or false start).
We plan to collect actual spoken dialogue interactions to further refine our system through a crowdsourcing experiment in a language assessment task. Specifically, the ASR sub-sytem can benefit from sampling the elicited responses, measuring their apparent semantic uncertainty and tailoring system's lexicon and language model to better handle acoustic uncertainty of non-native speech.