Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances

We investigate the problem of predicting the quality of automatic speech recognition (ASR) output under the following rigid constraints: i) reference transcriptions are not available, ii) conﬁdence information about the system that produced the transcriptions is not accessible, and iii) training and test data come from multiple domains. To cope with these constraints (typical of the constantly increasing amount of automatic transcriptions that can be found on the Web), we propose a domain-adaptive approach based on multitask learning. Different algorithms and strategies are evaluated with English data coming from four domains, showing that the proposed approach can cope with the limitations of previously proposed single task learning methods.


Introduction
The variety of applications for large vocabulary speech recognition technology (LVCSR) is rapidly growing. For instance, automatic transcriptions are now used, either as-is or as rough material to be checked and corrected by humans, for captioning and subtitling DVD movies, Youtube videos, TV programs and recordings in noisy environments such as meetings and teleconferences. To enable further integration in these and other scenarios, the improvement of the core automatic speech recognition (ASR) technology should go hand in hand with the development of evaluation methods adequate to address new needs and constraints. Indeed, the standard evaluation protocol, based on computing the word error rate of transcription hypotheses against reference transcripts, 1 is not always a viable solution.
In terms of needs, the aforementioned applications call for efficient and replicable evaluation methods suitable for real-time processing. While the availability of manually-created reference transcripts is a core ingredient for system development, tuning and lab testing, their use for on-field evaluation (i.e. during the actual use) is impractical for obvious reasons (i.e. the need of a quick response).
In terms of constraints, the problem is that ASR technology is often used as a black-box, that is, without any knowledge of how the transcriptions are generated. 2 This calls for techniques capable to estimate ASR output quality under the rigid constraint of having, as a basic source of information, only the spoken utterance (the acoustic signal) and the transcription itself. Indeed, the invaluable information provided by current confidence estimation methods (e.g. word posterior probabilities (Evermann and Woodland, 2000;Wessel et al., 2001), consensus decoding (Mangu et al., 2000) and minimum Bayesrisk decoding (Xu et al., 2010)) is not accessible when evaluating the output of an unknown system.
To cope with these issues,  proposed a reference-free ASR quality estimation (QE) method capable to operate both in a glass-box (i.e. having access to confidence information) and in a black-box fashion (i.e. without any knowledge about the ASR system's inner workings). According to the authors, despite the promising evaluation results, the supervised learning approach adopted has a main limitation: the degradation in performance when models are trained on non-homogeneous data that comes from different domains, speakers, or systems. However, although empirical evidence of this limitation is provided, the robustness of ASR QE systems to the heterogeneity of training and test data is left as an open issue.
Filling this gap, which is the goal of this paper, would be a significant step towards real-time ASR output evaluation, and its seamless integration in a number of application frameworks. Along this direction, we propose and evaluate a supervised domain adaptation technique based on multitask learning (Caruana, 1997). Our approach aims to exploit training data coming from different "domains" (in a broad sense, e.g. different genres, speakers, topics, styles, etc.) and to obtain ASR QE models that are robust to differences with respect to the test data. Experiments are carried out with English data coming from four domains, and by comparing different algorithms and strategies.
Overall, our contributions can be summarized as follows: • Multitask learning (MTL) is investigated for the first time in the ASR QE scenario, as a way to cope with the dissimilarity between training and test data coming from multiple domains.
• The QE problem is approached both as a regression (assignment of real-valued quality labels) and as a binary classification task (assignment of 'good'/'bad' labels according to a given, arbitrary WER threshold). The latter task is introduced as a preliminary study.
• Results are thoroughly analyzed, considering both the amount of training data coming from the different domains and the relative distance between their distributions.

Related Work
In the ASR field, most prior works that address the reference-free estimation of output quality fall into the confidence estimation (CE) framework. In this framework, the reliability of a transcription is estimated from the system's standpoint, that is, as a function of the process that generated the transcription (Sukkar and Lee, 1996;Evermann and Woodland, 2000;Wessel et al., 2001;Sanchis et al., 2012;Seigel, 2013, inter alia). In CE, the information available to the estimator covers all the aspects of the decoding process (e.g. word posterior probabilities, n-best lists, hypotheses density, language model scores). Although related to our problem, CE hence builds on a strong assumption (i.e. the ASR system is known), which does not hold in many situations. Quality estimation, instead, operates in the least favorable condition in which, besides the lack of references, the ASR system is regarded as a "blackbox". To our knowledge, the study proposed in  is the most relevant related work along this direction. In their investigation, the authors run a set of experiments aimed to predict the WER of automatically transcribed utterances in different testing conditions (by varying the distance between training and test data), with different stateof-the-art learning algorithms (all for regression), and with different groups of features (the so called "black-box" and "glass-box" feature groups). The major problem emphasized in their analysis is the strong dependency between QE models and the degree of homogeneity of training and test data. From the application perspective, this is a severe limitation since (as in any other supervised learning setting) the similarity of training and test sets is a strong requirement that should be bypassed (possibly with minimal loss in performance). This issue, which has not been addressed yet, is the starting point of our investigation.
Another aspect that so far has been disregarded concerns the type of estimates that a model should return. Indeed, while ASR QE has been explored as a regression task (i.e. aiming to return real-valued quality estimates), nothing has been done to approach it as a classification problem (i.e. assigning quality estimates chosen from two or more classes). In classification mode, we return explicit good/bad labels based on a fixed, application-dependent quality criterion defined a priori (a threshold set on training data). Since the way to present the quality estimates can have interesting effects on their practical use, the impact of the aforementioned learning problem on a supervised classification setting is another aspect that deserves investigation and motivates our work.

Multitask Learning for Adaptive ASR Quality Estimation
The problem of dealing with different distributions between training and test data is broadly investigated by the machine learning community. In particular, approaches for dealing with domain drift are proposed within the scope of transfer learning, whose aim is to explore knowledge from one or more source tasks (henceforth, we use the terms domain and task interchangeably) and apply it to a target task (Pan and Yang, 2010). In this paper we use a transfer learning technique called multitask learning (MTL), which explores domain-specific training signals of related tasks to improve model generalization (Caruana, 1997). MTL is an inductive transfer method that assumes that the tasks are related and share a certain structure that allows knowledge transfer. In early works, for instance, these shared structures are the hidden layers of a neural network (Caruana, 1997). 3 The authors showed that MTL improves over learning each task in isolation (called single task learning, STL henceforth) for different problems. Several approaches to MTL have been proposed and each makes different assumptions about the structure shared among the tasks. In this work we explore three different MTL algorithms that deal with task relatedness in different ways.
Before defining each one of the three approaches, we introduce some basic notation previously used by . In MTL there are K ∈ N tasks and each task k ∈ [1, K] has m k training in- where d is the number of features and y i ∈ R is the output (the response variable or label). For each task, the input features and labels form two different matrices X (k) = [x 1,(k) , . . . , x m k ,(k) ] and Y (k) = [y 1,(k) , . . . , y m k ,(k) ], respectively. The weights of the features for all tasks are represented by matrix W, where each column corresponds to a task and each row corresponds to a feature. The function L(W, X, Y) is the loss function defined for each algorithm. We work with two loss functions: • Least squares (for regression), defined as MTL Lasso. This algorithm extends the idea of the Lasso (Tibshirani, 1996) to the MTL setting. In MTL Lasso the 1 -norm (the sum of the absolute values of the weights vector, given by applied to all the tasks at once (the ||W|| 1 component in Eq. 1). The λ ∈ [0, 1] parameter controls the level of regularization applied to the model. In other words, the sparsity of the predicted model is controlled via λ which weights the 1 -norm across all tasks.
MTL L21. This algorithm (Argyriou et al., 2007) learns a low-dimensional representation of the features across tasks, and induces sparsity on the feature weights for all the tasks at the same time. This is achieved through the use of a group regularizer that penalizes the weights matrix W with the 2,1norm (Eq. 2). This norm is defined as where d is the number of features and W i is the i-th row of W. It is obtained by first computing the 2-norm of each row in W (the features) and then computing the 1-norm over the resulting vector. The 2-norm of a vector is given by The parameter λ ∈ [0, 1] controls the regularization applied to the model. MTL L21 assumes that all tasks share the same feature representation.
Robust MTL. This algorithm does not assume that all the tasks share the same feature representation as the previous two algorithms do . Moreover, RMTL uses two different structures: one for grouping related tasks to share knowledge; the other for identifying irrelevant tasks and keeping them in a different group that does not share information with the first one. This is to cope with situations in which, since tasks are not related, negative transfer of information across tasks might occur, thus harming the generalization of the model. The algorithm approximates task relatedness via a low-rank structure and identifies outlier tasks using a group-sparse structure (column-sparse, at task level). RMTL minimizes the expression described in 3. It employs a non-negative linear combination of the trace norm (the task relatedness component L) and a column-sparse structure induced by the 1,2norm (the outlier task detection component S). If a task is an outlier it will have non-zero entries in S.
In Eq. 3 W is subject to L + S, where ||.|| * is the trace norm, given by the sum of the singular values σ i of W, and ||S|| 1,2 is the group regularizer that induces sparsity on the tasks. It is obtained by first computing the 1 -norm over the columns of W and then applying the 2 -norm over the resulting vector. The λ l and λ s parameters control the level of regularization of L and S, respectively.
All the MTL algorithms presented in this section are linear, with different regularization terms. While RMTL is only defined for regression, the other algorithms are defined for both regression and classification.
compared with the STL baseline, both in regression and in binary classification. Given a set of (signal, transcription, WER) tuples as training instances, our task is to label new unseen (signal, transcription) test pairs with a WER prediction (regression models) or with a good/bad tag (classification models) depending on the quality of the transcription.
In classification, the class boundary is defined a priori, according to an arbitrary threshold τ set on the WER of the instances: those with a W ER ≤ τ will be considered as positive examples while the others will be considered as negative examples. Different thresholds can be set to experiment with testing conditions that reflect a variety of applicationoriented requirements. We work at one extreme, in which a value of τ close to zero (0.05) emphasizes systems' ability to precisely identify high-quality transcriptions (those with W ER ≤ τ ). Any application that requires precise judgments to isolate highquality ASR output can potentially benefit of such optimization (e.g. data selection for acoustic modeling using a QE-based active learning model). The investigation of other thresholding schemes, however, is certainly an aspect that we want to explore in the future.
The small value of τ selected produces a skewed distribution of classes, with a ratio of good to bad labels across the four domains of about 75% "good" and 25% "bad". To cope with this issue, we use a sample weighting technique while training the classification models (Veropoulos et al., 1999). We assign a weight w to each of the training instances, computed as the inverse of its class frequency in the training set. In other words, w is obtained by dividing the total number of training samples by the number of training samples belonging to the class of the given utterance.

Data
Our datasets include English audio recordings from four different domains: broadcast news (henceforth News), political speeches (Legal), weather reports (Weather) and talks of single speakers in the context of the TED talks (TED). All datasets (see Table 1 for details) were used in past ASR evaluation campaigns, and are provided with manual reference transcriptions associated to each audio recording. News. We use the HUB4 5 corpus, which contains 104 hours of broadcasts from different television and radio networks. We selected the 1999 test set of the DARPA Hub-4 evaluation, consisting of two recordings acquired in TV studios and containing speech of professional speakers reading news.
Legal. This audio database 6 contains recordings of European Parliament members speaking in plenary sessions, as well as recordings of interpreters (non-native speakers). Speech is hence quite spontaneous, and a relevant level of reverberation is present due to the usage of Weather. This dataset is formed by recordings of weather reports broadcasted by the BBC English TV channel, and contains both national and local weather forecasts. There are roughly 50 native speakers and the speech is delivered very quickly. Although the speakers are native and the recordings are performed in a controlled environment, there are some hesitations, grammar errors or lengthy formulations in the recordings which are corrected in the captions (which can thus be considered as loose reference transcripts (Mohr et al., 2013)).
TED. This dataset contains audio recordings of English speakers (28 different talks) and was used within the IWSLT 2013 evaluation campaign (Cettolo et al., 2013). This domain presents large variability of topics (hence a large, unconstrained vocabulary), presence of non-native speakers, and a rather 5 distributed by the Linguistic Data Consortium and available at https://catalog.ldc.upenn.edu/docs/ LDC2000S88/ 6 http://catalog.elra.info/product_info. php?products_id=1032 informal speaking style.
Given their diverse nature, the four domains present a big challenge both for ASR and QE systems. From Table 1 it is possible to grasp several differences among them. One aspect that reflects such differences is the WER of the ASR system we used to transcribe the utterances (described in Section 4.2). The lowest WER is for Weather, a domain in which the speech is planned. This is also the domain with the shortest average utterance duration (5 sec.), the lowest number of speakers (36) and the lowest number of running words (23,722). The higher WER achieved on the other domains is due to the more challenging conditions posed by each of them. TED and News include speeches about unconstrained topics, and their average utterance durations tend to be longer than for the other two domains. News is the shortest domain in duration and the smallest in number of utterances (150 min. for 737 utterances), but has the highest number of speakers. This means that there are very few utterances for each speaker, in average, and that both the ASR and the QE system must cope with the differences in speech for all these subjects. Legal presents the second largest number of speakers, both native and non-native, using a specific terminology on a varied number of topics.

ASR System
The ASR engine used in our experiments makes use of Hidden Markov Models (HMMs) of triphone units and of 4-gram back-off language models (LMs). HMMs are trained on domain-specific sets of audio data. The HUB4 training corpus is released with "verbatim" transcriptions of the audio signals while, for the other three domains (i.e. Legal, Weather and TED), training data have only associated captions, which are not always exact transcriptions of the corresponding audio recordings. To extract audio segments with reliable transcriptions we hence applied a lightly supervised training procedure (Lamel et al., 2001). This resulted in 67 hours of recordings for the Weather domain, 144 hours for TED, 164 hours for News and 100 hours for Legal. For LM training, first, a general purpose LM is trained on the Gigaword text corpus (5th ed.) (Parker et al., 2011) then, it is adapted to all domains, using domain specific text data. Each auto-matic transcription of the data presented in Table 1 is generated with the corresponding word and time boundaries that are aligned with the reference utterances. This allows us to compute the utterance WER and the features for the various prediction models.

Features
Our models are trained with the same 52 "blackbox" features proposed by , which can be categorized in three groups: Signal, Hybrid and Textual. The first group aims to capture the difficulty to transcribe the input and is extracted by looking at the signal segment as a whole. Hybrid features provide a more fine-grained way to capture the transcription difficulty, by linking the signal to the output transcription. Textual features aim to capture the plausibility/fluency of a transcription considering its surface word information.

Evaluation Metrics
Regression. Our regression models are evaluated in terms of mean absolute error (MAE). The MAE, a standard error measure for regression, is the average of the absolute difference between the predictionŷ i of a model and the gold standard response y i for all instances in the test set. As it is an error measure, lower values indicate better performance.
Classification. To handle the imbalanced class distribution, and equally reward the correct classification on both classes, our evaluation is carried out in terms of balanced accuracy (BA -the higher the better), which is computed as the average of the accuracies on the two classes (Brodersen et al., 2010). When the distribution of classes is balanced, BA is equal to the accuracy metric.

Baselines
Regression. We compare the MTL methods against two baselines. The first one, simple but often hard to beat for regression models, is computed by labeling all the test instances with the Mean WER value calculated on the training set. The second baseline is an STL algorithm trained on data from the target domain. The algorithm that we used (STL Elastic henceforth) is the elastic net (Zou and Hastie, 2005). Parameter estimation is performed with 5fold cross-validation.
Classification. In this setting we also consider two baselines. The first one (Majority) is computed by labeling all the test instances with the most frequent label in the training set and, by definition, corresponds to a score of 0.5 in terms of balanced accuracy. The second classification baseline is the logistic regression (STL LogReg henceforth), also known as maximum entropy algorithm (Hastie et al., 2009). We perform parameter optimization for LogReg using stratified 5-fold cross-validation in a randomized search process (Bergstra and Bengio, 2012).
For both STL baselines we selected algorithms 7 that induce linear models and use the same loss functions (least squares for regression and logistic regression for classification) of the MTL methods.

Results and Discussion
To mitigate the effect of having considerably different amounts of training data in the four domains, and equally weight their contribution to the learning task, all our models (STL and MTL) are trained using the same number of instances from all the domains and, at most, half of the data available for the smallest domain, News (i.e. 362 instances). To analyze performance variations with different amounts of data, we create subsets of the 362 instances, for 10 different sizes ranging from 10% to 100% of the instances for each domain. 8 We repeat this process 30 times by randomly shuffling all the data available for each domain. For each of the resulting learning curves, the plots in this section present the confidence intervals 9 (at 95%) for the 30 different train/test splits.
In addition to the STL model trained only on indomain data, we also experiment with an STL model trained on the concatenation of the training data of all domains. Its results are, on average, statistically comparable to, or worse than, STL in-domain for both regression and classification.
Regression. Among the three MTL regression algorithms, RMTL achieves the best results in all our tests. This suggests that its capability to handle domain divergence, thus avoiding negative transfer, is required to increase performance. For the sake of visualization, in the plots in Figure 1 we hence omit the curves of the other MTL methods, keeping only those of RMTL and the two baselines.
As shown in the figure, for the Legal domain, RMTL results are better than those of both the baselines (lower MAE) even with 30% of the data and, except in one case (40% of the data), the improvement over STL (always the stronger baseline) is statistically significant. For Weather and TED, the improvement is less evident: more data are required to outperform the STL baseline (respectively 50% and 60%), the improvements are not always statistically significant and, for TED, the MAE results converge to those of STL with 100% of the data. For the News domain RMTL's performance is always comparable to STL. An interesting behavior can be observed in the Legal domain, in which the Mean baseline degrades as we add training data. This suggests that, even internally to the domain, training and test labels have very different distributions. A smaller degradation is observed for the STL model, which improves over the Mean baseline as it also uses the information captured by the features. The two baselines, however, assume that both training and test data come from similar distributions. Instead, by taking advantage also of the knowledge transferred from the other domains, RMTL allows to cope with the differences between training and test.
Classification. In this setting we compare the MTL algorithms (L21 and Lasso) with the STL (Lo-gReg) and Majority baselines. As shown in Figure 2, the two MTL models (which significantly outperform the Majority baseline in all conditions) always achieve a higher balanced accuracy than single task learning in three domains (TED, Legal and Weather). In the Weather domain, the performance improvement over the STL baseline is always statistically significant when using from 20% to 100% of the training data. For TED and Legal, MTL performance tends to converge to the results of STL when the models are trained on 100% of the data (around 65% BA), with an improvement that remains statis- tically significant only for TED. For the News domain, similar to the regression setting, the improvement of MTL over STL is less evident. Indeed, only L21 outperforms the single task baseline but the difference is not statistically significant.
Our classification results can be explained taking into consideration the distribution of positive and negative instances in each domain. Weather, for which MTL always outperforms STL, has the most balanced distribution (35% good and 65% bad). In the other three domains, instead, the proportion of negative samples is always above 77%. Although in this penalized condition all algorithms are supported by sample weighting, MTL seems to better exploit this technique when the target domain is balanced.
The challenging nature of the data we are using (described in Section 4.1) is corroborated by the moderate performance achieved by STL. Although it is trained with in-domain data, the best STL classification model (for the Legal domain) does not exceed a BA of 66%. In this difficult scenario, the usefulness of MTL is demonstrated by its capability of reaching the best performance of STL with smaller amounts of data in most of the cases (e.g. 30% of the data for the Legal domain).
Domains divergence. To further analyze the performance of MTL in regression and classification, following previous works on MTL and domain adaptation in computer vision (Costante et al., 2014;Samanta et al., 2014), we use maximum mean discrepancy (MMD) as a measure of divergence between domains. MMD is an effective way to compare two multivariate distributions p and q by minimizing the difference in Reproducing Kernel Hilbert Space (RKHS) between the means of the projected distributions (Gretton et al., 2012). It is defined as where p and q are points sampled i.i.d. from two domains and f (.) is a continuous bounded function on p and q (usually a unit ball function). We measure the pairwise divergences among the domains described in Section 4.1 using the features extracted and a radial basis function kernel. The divergences are presented in Figure 3.
According to the pairwise MMD, the most di-vergent pair is News-Weather, which is followed by News-Legal. The distance between News and the other domains indicates that, when it is used as target, knowledge transfer from the other domains might be problematic. In fact, looking at the results obtained by classification and regression models for News, we notice that none of the MTL methods achieves significant improvements over the STL baselines. Furthermore, the RMTL regression learning curve (Figure 1) for News shows that RMTL follows the same curve of STL, meaning that it is able to handle the high divergence between News and the other domains and hence, it learns mostly from indomain data.
In general, the divergence measurements between the domains are relatively high (the values are closer to 1 than to 0). This is not surprising given the intra-and inter-domain variability of speakers and topics, the different conditions in which speech was recorded, and the WER differences across domains. However, the interesting aspect evidenced by the measurements is that MMD allows to successfully approximate such domain differences (and, likely, other more implicit diversity indicators), thus being a useful instrument to measure domain relatedness.

Conclusion
We presented a supervised approach to ASR quality estimation aimed to cope with large differences between training and test data. To achieve robust-ness and adaptability to such differences, we exploited the capability of multitask learning, which allows QE models to make the best use of training data coming from multiple domains by transferring knowledge across them. The MTL learning paradigm was applied both in regression mode (WER prediction) and, in a preliminary investigation, for binary classification (assignment of 'good'/'bad' quality labels). In both settings, we experimented with different amounts of English data coming from four very diverse domains (different genres, speakers, topics, and styles).
Our results indicate that MTL, which we used for the first time in ASR QE 10 , is able to take advantage of data coming from such heterogeneous domains and to significantly improve over single-task learning baselines both in regression and in classification. Although the extent of the improvement depends on the divergence between the domains (a major issue for any supervised learning task), our results show that in the worst case MTL performance converges to the results of single-task learning. Overall, by suggesting a way to overcome the main limitations of previous approaches, our study opens interesting research avenues towards reference-free, system-agnostic and real-time ASR output evaluation.