Towards Understanding ASR Error Correction for Medical Conversations

Domain Adaptation for Automatic Speech Recognition (ASR) error correction via machine translation is a useful technique for improving out-of-domain outputs of pre-trained ASR systems to obtain optimal results for specific in-domain tasks. We use this technique on our dataset of Doctor-Patient conversations using two off-the-shelf ASR systems: Google ASR (commercial) and the ASPIRE model (open-source). We train a Sequence-to-Sequence Machine Translation model and evaluate it on seven specific UMLS Semantic types, including Pharmacological Substance, Sign or Symptom, and Diagnostic Procedure to name a few. Lastly, we breakdown, analyze and discuss the 7% overall improvement in word error rate in view of each Semantic type.


Introduction
Off-the-shelf ASR systems like Google ASR are becoming increasingly popular each day due to their ease of use, accessibility, scalability and most importantly, effectiveness. Trained on large datasets spanning different domains, these services enable accurate speech-to-text capabilities to companies and academics who might not have the option of training and maintaining a sophisticated state-ofthe-art in-house ASR system. However, for all the benefits these cloud-based systems provide, there is an evident need for improving their performance when used on in-domain data such as medical conversations. Approaching ASR Error Correction as a Machine Translation task has proven to be useful for domain adaptation and resulted in improvements in word error rate and BLEU score when evaluated on Google ASR output (Mani et al., 2020).
However, it is important to analyze and understand how domain adapted speech may vary from

Model
Transcript Reference you also have a pacemaker because you had sick sinus syndrome and it's under control Google ASR you also have a taste maker because you had sick sinus syndrome and it's under control S2S you also have a pacemaker because you had sick sinus syndrome and it's under control Reference like a heart disease uh atrial fibrillation Google ASR like a heart disease asian populations S2S like a heart disease atrial fibrillation Table 1: Examples from Reference, Google ASR transcription and corresponding S2S model output for two medical words, "pacemaker" and "atrial fibrillation".
In this work, we investigate how adapting transcription to domain and context can help reduce such errors, especially with respect to medical words categorized under different Semantic types of the UMLS ontology.
ASR outputs. We approach this problem by using two different types of metrics -1) overall transcription quality, and 2) domain specific medical information. For the first one, we use standard speech metric like word error rate for two different ASR system outputs, namely, Google Cloud Speech API 1 (commercial), and ASPIRE model (open-source) (Peddinti et al., 2015). For the second type of evaluation, we use the UMLS 2 ontology (O., 2004) and analyze the S2S model output for a subset of semantic types in the ontology using a variety of performance metrics to build an understanding of effect of the Sequence to Sequence transformation.

Related Work
While the need for ASR correction has become more and more prevalent in recent years with the successes of large-scale ASR systems, machine translation and domain adaptation for error correction are still relatively unexplored. In this paper, we build upon the work done by Mani et al. (Mani et al., 2020). However, D'Haro and Banchs (D'Haro and Banchs, 2016) first explored the use of machine translation to improve automatic transcription and they applied it to robot commands dataset and human-human recordings of tourism queries dataset. ASR error correction has also been performed based on ontology-based learning in (Anantaram et al., 2018). They investigate the use of including accent of speaker and environmental conditions on the output of pre-trained ASR systems. Their proposed approach centers around bioinspired artificial development for ASR error correction. (Shivakumar et al., 2019) explore the use of noisy-clean phrase context modeling to improve ASR errors. They try to correct unrecoverable errors due to system pruning from acoustic, language and pronunciation models to restore longer contexts by modeling ASR as a phrase-based noisy transformation channel. Domain adaptation with off-the-shelf ASR has been tried for pure speech recognition tasks in high and low resource scenarios with various training strategies Renals, 2014, 2015;Meng et al., 2017;Sun et al., 2017;Shinohara, 2016;Dalmia et al., 2018) but the goal of these models was to build better ASR systems that are robust to domain change. Domain adaptation for ASR transcription can help improve the performance of domain-specific downstream tasks such as medication regimen extraction (Selvaraj and Konam, 2019).

Domain Adaptation for Error Correction
Using the reference texts and pre-trained ASR hypothesis, we have access to parallel data that is in-domain (reference text) and out-of-domain (hypothesis from ASR), both of which are transcriptions of the same speech signal. With this parallel data, we now frame the adaptation task as a translation problem.
Sequence-to-Sequence Models : Sequence-tosequence (S2S) models (Sutskever et al., 2014) have been applied to various sequence learning tasks including speech recognition and machine translation. Attention mechanism (Bahdanau et al., 2014) is used to align the input with the output sequences in these models. The encoder is a deep stacked Long Short-Term Memory Network and the decoder is a shallower uni-directional Gated Recurrent Unit acting as a language model for decoding the input sequence into either the transcription (ASR) or the translation (MT). Attention-based S2S models do not require alignment information between the source and target data, hence useful for monotonic and non-monotonic sequence-mapping tasks. In our work, we are mapping ASR output to reference hence it is a monotonic mapping task where we use this model.  at word level, specific alignment handling techniques are required to match the output of multiple ASR systems. This is achieved using utterance level timing information i.e., start and end time of an utterance, and obtaining the corresponding words in the ASR system output transcript based on word-level timing information (start and end time of each word). To make sure same utterance ID is used across all ASR outputs and the ground truth, we first process our primary ASR output transcripts from Google Cloud Speech API based on the ground truth and create random training, validation and test splits. For each ground truth utterance in these dataset splits, we also generate corresponding utterances from ASPIRE output transcripts similar to the process mentioned above. This results in two datasets corresponding to Google Cloud Speech and ASPIRE ASR models, where utterance IDs are conserved across datasets. However, this does lead to ASPIRE dataset having a lesser utterances as we process Google ASR outputs first in an effort maximize the size of our primary ASR model dataset.
Pre-trained ASR: We use the Google Cloud Speech API for Google ASR transcription and the JHU ASPIRE model (Peddinti et al., 2015) as two off-the-shelf ASR systems in this work. Google Speech API is a commercial service that charges users per minute of speech transcribed, while the ASPIRE model is an open-source ASR model. We explore the trends we observe in both-a commercial API as well as an open-source model.

Transcription Quality
We use WER and BLEU scores to evaluate improvement on ASR model outputs using the S2S model. A consistent gain is observed across all  metrics, with an absolute improvement of 7% in WER and a 4 point absolute improvement in BLEU scores on Google ASR. While the Google ASR output can be stripped of punctuation for a better comparison, it is an extra post-processing step and breaks the direct output modeling pipeline. If necessary, ASPIRE model output and the references can be inserted with punctuation as well.

Qualitative Analysis
In Table 4, we compare S2S adapted outputs with Google ASR for each semantic type, broken down by Precision, Recall and F1 scores. The two outputs are also compared directly by counting utterances where S2S model made the utterance better with respect to a semantic term -it was present in the reference and S2S output but not Google ASR, and cases where S2S model made the utterance worse -semantic term was present in the reference and Google ASR but not S2S output. We refer to this metric as semantic intersection in this work. As observed, the F1 scores are higher for S2S outputs for all the semantic types in the Ontology, except for one (BPOOC) where it ties. In terms of Precision and Recall too, S2S performs better for most categories. These numbers can be discussed with a couple of underlying factors -how common or rare the semantic terms are on average for each semantic type, and how many training examples has the model seen for those terms. This is important to consider as Google ASR learns on a much larger vocabulary of words spanning many different domains, where as S2S is trained on a domain specific dataset. For example, we see a large gain on Precision for DP, which can be attributed to the rarity of the terms under this category, like 'echocardiogram', 'pacemaker', etc. Its also for this reason we see only a slight improvement in Precision for PS even though it has the most number of training examples. Many of the medication names are rare, but a lot of them are pretty common  nowadays even though they are domain specific, like 'aspirin'. Moreover, this is also supported by the numbers observed for BPOOC, where terms like 'legs', 'heart' and 'lungs' are the top 3 most frequently occurring words.
The number of unique terms for the S2S output are lower in comparison to Google ASR and reference as observed in Table 4. This might indicate that the S2S model is incorrectly modifying some Google ASR output medical terms which may not have as many examples in the Training set. However, our semantic intersection metric indicates that we get an overall improvement in all categories, except for DP. We hypothesize this to be largely due to a combination of how rare the words are, and the overall number of training examples for DP being low. When we calculate semantic intersection on the Full set, we get almost equal results for S2S and Google ASR outputs, 0.5 and 0.6 respectively. When we look at our top 5 and bottom 5 least frequent terms for each semantic types, almost all the terms overlap between S2S, Google ASR and reference, even though the number of unique terms might be less for S2S. Overall, it is evident from analyzing the results that as the number of occurrences increases for each medical term, the performance of the S2S model in identifying errors and correcting them increases rapidly, as shown in Table 2 and Table 4.
In a production environment, the S2S model may be confidently used for correcting ASR errors for top K most frequently occurring medical terms, where the value of K must be decided based on the dataset available for training. Future extension of this work will also be looking into the class imbalance problem for a more robust performance on different semantic types.

Conclusion
We present an analysis of how ASR Error Correction using Machine Translation impacts the different semantic types of the UMLS ontology for a medical conversation. We run the S2S model on a dataset of Doctor-Patient conversations as a post-processing step to optimize the Google off-theshelf ASR system. We use different input representations and compare the performance of our S2S model using WER and BLEU scores on Google ASR and ASPIRE outputs. We deep dive into how our adaptation model affect medical WER for each semantic type, and breakdown the results using Precision, Recall, F1 and Semantic Intersection numbers between S2S and Google ASR. We establish the robustness of S2S model performance for more frequently occurring medical terms. In the future, we want to explore other representations like phonemes which might capture ASR errors better, and address the class imabalance problem for rarer medical terms in different semantic types.