Learning Robust and Multilingual Speech Representations

Unsupervised speech representation learning has shown remarkable success at finding representations that correlate with phonetic structures and improve downstream speech recognition performance. However, most research has been focused on evaluating the representations in terms of their ability to improve the performance of speech recognition systems on read English (e.g. Wall Street Journal and LibriSpeech). This evaluation methodology overlooks two important desiderata that speech representations should have: robustness to domain shifts and transferability to other languages. In this paper we learn representations from up to 8000 hours of diverse and noisy speech data and evaluate the representations by looking at their robustness to domain shifts and their ability to improve recognition performance in many languages. We find that our representations confer significant robustness advantages to the resulting recognition systems: we see significant improvements in out-of-domain transfer relative to baseline feature sets and the features likewise provide improvements in 25 phonetically diverse languages including tonal languages and low-resource languages.


Introduction
The input representation of machine learning model strongly determines the difficulty faced by the learning algorithm, how much data the learner will require to find a good solution, and whether the learner generalizes out of sample and out of the domain of the training data. Representations (or features) that encode relevant information about data enable models to achieve good performance on downstream tasks, while representations that are invariant to factors that are not relevant to downstream tasks can further improve generalization. Traditionally, many invariances were hard-coded in feature extraction methods. For example, in image representations, geometric and photometric invariance has been investigated (Mundy et al., 1992;Van De Weijer et al., 2005). For acoustic representations, standard MFCC features are sensitive to additive noise and many modifications have been proposed to overcome those limitations (Dev and Bansal, 2010;Kumar et al., 2011).
Recently, unsupervised representation learning algorithms have shown significant improvements at learning representations that correlate well with phonetic structure (van den Oord et al., 2018;Kahn et al., 2019b) and improving downstream speech recognition performance . Most of this work focused on learning representations from read English speech (from the LibriSpeech and LibriVox datasets) and evaluating the features when used to recognize speech in a rather similar domain (read English text). However, this approach to evaluation fails to test for the invariances that we would like good speech representations to have: robustness to domain shifts and transferability to other languages.
In our experiments we learn representations from 8000 hours of diverse and noisy speech, using an extended version of contrastive predictive coding model: bidirectional predictive models with dense residual connections ( §2- §4), and evaluate the robustness and transferability of our representations by estimating how invariant they are to domain and language shifts. To do so, an ASR model is trained using our representations on one dataset but evaluated on the test sets of other datasets. In this experiment, we find that the representations derived from the large pretraining dataset lead the ASR model to be much more robust to domain shifts, compared to both log filterbank features as well as to pretraining just on LibriSpeech. We also train ASR models on 25 languages, including low-resource languages (e.g. Amharic, Fongbe, Swahili, Wolof), and show that our representations significantly outperform both standard features and those pretrained only on clean English data in the language transfer setup.
In summary, we confirm several increasingly common patterns that may be discerned in the literature on unsupervised representation learning, across a variety of modalities. First, scale matters: good representation learning requires a large amount of data. Second, unsupervised representations consistently improve robustness on downstream tasks. And finally, representations learned from multilingual data can transfer across many languages.

Contrastive Predictive Coding: CPC
Unsupervised representation learning methods rely on differentiable objectives which quantify the degree to which representations have succeeded at capturing the relevant characteristics in data. Mutual information measures relationships between random variables (Fano and Hawkins, 1961). Mutual information maximization techniques, that learn representations that describe data by maximizing mutual information between data and representation variables, have been explored for a long time in unsupervised representation learning (Linsker, 1988;Bell and Sejnowski, 1995). However, since the exact computation of mutual information is not tractable for continuous variables, recently many estimators have been proposed for enabling unsupervised representation learning with neural networks (Belghazi et al., 2018;van den Oord et al., 2018;Poole et al., 2019).
Contrastive predictive coding (van den Oord et al., 2018, CPC) is a mutual information maximization method that has been successfully applied to many modalities such as images and speech (Hénaff et al., 2019;. The objective is designed to extract features that allow the model to make long-term predictions about future observations. This is done by maximizing the mutual information of these features with those extracted from future timesteps. The intuition is that the representations capture different levels of structure dependent on how far ahead the model predicts. For example, if the model only predicts a few steps ahead, the resulting representations can capture local structures. On the other hand, if the model predicts further in the future, the representations will need to infer "slow features" (Wiskott and Sejnowski, 2002); more global structures such as phonemes, words and utterances in speech.
The overall unsupervised learning process is visualized in Figure 1. Given a raw audio signal of length L (x = x 1 , x 2 , . . . , x L , x i ∈ R where x i represents the acoustic amplitude at time i), a function g enc encodes the audio signals into vector representations (z = z 1 , z 2 . . . , z M , z ∈ R dz ). Next, an autoregressive function g ar , such as a recurrent neural network, summarizes the past representations and produces context vectors (c = c 1 , c 2 . . . , c M , c ∈ R dc ). The representations are learned to maximize mutual information between context vectors (c t ) and future latent representations (z + k) as follows: Since the mutual information is not tractable for high dimensional data, it is common to use a lower-bound on the mutual information such as InfoNCE (van den Oord et al., 2018) which is a loss function based on noise contrastive estimation (Gutmann and Hyvärinen, 2010). Given a set Z = {z 1 , . . . z N } which contains one positive sample from p(z t+k |c t ) and N − 1 negative samples from a "noise" distribution p(z), the approximated lower-bound is written as: where f k (c t , z t+k ) is a scoring function. We used the standard log-bilinear model as follows: The loss function we maximize is a sum of the InfoNCE loss for each step, L NCE = t k L NCE tk and the negatives are uniformly sampled from representations in the same audio signal (z).

Methods
In this section, we describe our models and objectives for unsupervised representation learning and downstream speech recognition. First, an acoustic feature extractor is trained with a bidirectional variant of contrastive predictive coding on an unlabeled audio dataset. Next, the parameters of this model are frozen and its output representations are used as input to train various speech recognition models, potentially on a different or smaller labeled dataset ( Figure 1).

Unsupervised learning with bi-directional CPC
Following the success of bidirectional models in representation learning (Peters et al., 2018;Devlin et al., 2019), we extend the original CPC method explained above with bidirectional context networks. The encoder function g enc is shared for both directions, but there are two autoregressive models (g fwd ar and g bwd ar ) which read encoded observations (z) from the forward and backward contexts, respectively. The forward and backward context representations c fwd t , c bwd t are learned with separate InfoNCE losses. When they are used for downstream tasks, a concatenation of two representations c t = [c fwd t ; c bwd t ] is used. A similar technique has been used in image representation learning where representations are learned along different spatial dimensions (Hénaff et al., 2019).
All audio signals have a sampling rate of 16kHz and we normalize the mean and variance of the input signals over each utterance in order to mitigate volume differences between samples. For architectures, we use encoder and autoregressive models similar to . The encoder function g enc , is a stack of causal convolutions with kernel sizes (10, 8, 4, 4, 4, 1, 1) and stride sizes (5, 4, 2, 2, 2, 1, 1), corresponding to a receptive field of 10 ms of audio. For autoregressive functions, we use a 13 layer causal convolution architecture with kernel sizes (1, 2, . . . , 12, 13) and stride size 1, for both forward and backward functions. Layer-normalization across the temporal and feature dimensions is applied to every layer. Also, each layer has dense skip connections with layers below as in DenseNet (Huang et al., 2017). The objective function we optimize is the sum of the forward and backward InfoNCE losses (eq.2).

Semi-supervised speech recognition
Once the acoustic representations are trained, the resulting context vectors (c) are used as inputs to character-level speech recognition models which predict transcriptions of audio-signals character by character. The model first predicts frame-level character probabilities with a series of convolution layers while the CTC forward algorithm (Graves et al., 2006) calculates conditional probabilities of a transcription given an audio signal. The model parameters are trained to maximize the log likelihood of the data. The training terminates when the word error rate on the development set stops improving or the model has trained for more than a certain number of epochs. The models are evaluated on the standard word error rate (WER) metric on held-out test data. During training, the parameters in the speech recognition models are trained with supervision but the parameters of the representation models remain fixed. For decoding, we use greedy CTC decoding. In most experiments, we do not use a language model (LM) in order to isolate the effects of the acoustic representations, but we do include results with a 4-gram LM to facilitate comparisons with published results.
Common practice in unsupervised representation learning is to evaluate learned representations using a linear classifier rather than a more complex nonlinear model. However, we find that a simple linear layer followed by a CTC decoder does not have enough capacity to recognize speech. Thus, for our first set of experiments we use a smaller version of DeepSpeech2 (Amodei et al., 2016) to predict the frame-level character probabilities. The model has two 2d-convolutions with kernel sizes (11, 41) and (11, 21) and stride sizes (2, 2) and (1, 2) and one unidirectional recurrent neural network (GRU) on top of the output from the convolution layers. A linear transformation and a softmax function are applied to predict frame-level character probabilities. We refer to DeepSpeech2 small for the model specifics (Amodei et al., 2016). In order to further investigate how the representations interact with larger speech recognition models, we use the timedelay neural networks (TDNN) that are commonly used in speech recognition (Collobert et al., 2016;Kuchaiev et al., 2018). These consist of 17 layers of 1d-convolutions followed by 2 fully connected layers. Refer to OpenSeq2Seq for a detailed description. 1 These large models have been designed to perform well with log-filterbank features and purely supervised learning on large datasets, so they represent a challenging and informative test case for the value of learned representations.

Datasets
We collected publicly available speech datasets which cover a variety of types of speech (e.g. read and spoken), noise conditions and languages. For unsupervised pretraining we use a combination of datasets, using the audio but not any transcriptions, even when they are available. For semi-supervised learning (i.e., evaluation) on top of the representations we use the transcribed datasets following their standard train-test splits. Table 1 summarizes the datasets used for unsupervised learning and English speech recognition tasks.
Unlabeled speech pretraining corpus For pretraining, we collected a diverse and noisy speech corpus from several existing datasets: the subset of Audio Set (Gemmeke et al., 2017) containing speech examples, the audio part of AVSpeech (Ephrat et al., 2018), and the Common Voice (CV) 2 dataset in all 29 available languages. In addition we used the audio from TIMIT (Garofolo, 1993)  tions. Finally, we include the audio (again ignoring transcriptions) from the standard training splits of the evaluation datasets below. This collection spans a range of recording conditions, noise levels, speaking styles, and languages and amounts to about 8000 hours of audio.
Transcribed read English For evaluation, we look at the performance of our representations on a variety of standard English recognition tasks, as well as their ability to be trained on one and tested on another. For read English, we use Lib-riSpeech (Panayotov et al., 2015) and the Wall Street Journal (Paul and Baker, 1992).

Transcribed spoken English
To explore more extreme domain shifts, we additionally used conversational speech and public speaking datasets. We used Switchboard (Godfrey et al., 1992), a standard conversational speech recognition dataset consisting of two-sided telephone conversations (test only). Since the data was recorded more than 10 years ago and at a lower sampling rate than the other corpora, it presents a noisy and challenging recognition problem. Finally, we also use the Tedlium-3 (Hernandez et al., 2018) corpus, a large spoken English dataset containing 450 hours of speech extracted from TED conference talks. The recordings are clear, but there is some reverberation.
Transcription normalization Since we are comparing ASR systems trained on one dataset but evaluated on the test set of another, we normalize transcriptions to reduce systematic biases in the transfer condition. To do so, we use the format of the LibriSpeech dataset, which also ensures that our results are comparable with standard speech recognition systems on that task (Kuchaiev et al., 2018). For the other datasets, transcriptions are lowercased and unpronounced symbols (e.g., punctuation, silence markers) are removed. We also remove utterances containing numbers as they are transcribed inconsistently across and within datasets.
Transcribed multilingual speech In order to evaluate the transferability of the representations, we use speech recognition datasets in 4 African languages collected by the ALFFA project, 3 Amharic (Tachbelie et al., 2014), Fongbe (A. A Laleye et al., 2016), Swahili (Gelas et al., 2012), Wolof (Gauthier et al., 2016), for evaluation. These languages have unique phonological properties (e.g. height harmony) and phonetic inventories, making them a good contrast to English. These African languages are low-resource, each with 20 hours or less of transcribed speech. We also use 21 phonetically diverse languages from OpenSLR. 4 See Appendix A for more detail.

Unsupervised Representation Learning
We train the model described above ( §3.1) using the datasets described in the previous section ( §4.1). Similarly to Schneider et al. (2019)), audio signals are randomly cropped with a window size 149,600 observations (9.35 seconds) and encoded with the model. The bidirectional contrastive predictive coding objective (Eq. 2) with prediction steps (k) 12 and negatives (N ) 10 is optimized with the Adam optimizer with learning rate 0.0001. A batch size of 128 is used as well as a polynomial learning rate scheduler with power 2 and gradient clipping with maximum norm 5.0. Training was terminated at 4.2 million steps based on speech recognition performance on the dev (= validation) set of the LibriSpeech corpus.

Robustness
Robustness to shifts in domain, recording conditions, and noise levels is an important desideratum for a good ASR system, and we hypothesized that the diversity of our largest pretraining regime would improve robustness along these dimensions. In contrast, standard MFCC features have been tested in terms of noise robustness and it is known that such representations are sensitive to additive noise (Zhao and Wang, 2013). Moreover, speech recognition systems developed on top of such features are not robust when they are evaluated on out-of-domain datasets (Amodei et al., 2016).
To test whether our pretraining approach improves robustness, we evaluate speech recognition models trained on the learned representations on many different datasets so as to investigate benefit of using the representations learned from largescale data. We compare ASR systems on all of the Wall Street Journal and LibriSpeech corpora with the same optimization as explained above and evaluate word error rate on different evaluation sets, such as phone call conversations (Switchboard). Table 2 summarizes the results on models trained on Wall Street Journal, LibriSpeech or the Tedlium corpora and evaluated on different evaluation sets. CPC-LibriSpeech and CPC-8k indicate representations are learned from LibriSpeech and 8000h of speech datasets listed above respectively. The features trained on large-scale data consistently outperform other representations across different evaluation sets. The speech recognition models trained on the Wall Street Journal perform badly on phone call data in general. However, CPC representations learned on large datasets are more robust than those trained only on read English data (LibriSpeech).

Low-resource Languages
Thus far, all our experiments have compared our representations in terms of their impacts on English recognition tasks (although we know that the pretraining dataset contains samples from many languages). We now turn to the question of whether these representations are suitable for driving recognition different languages with substantially different phonetic properties than English has. Specifically, we look at the performance on four languages-Amharic, Fongbe, Swahili, and Wolof-which manifest a variety of interesting phonological properties that are quite different from English. Evaluating on such languages will provide insights into the phonetic space learned in the representations. Moreover, our non-English languages are low-resource in terms of speech recognition data, but have 2-20 million native speakers each. It is therefore valuable if the representations learned from large-scale unlabelled data can improve low-resource speech recognition. Although there is a chance that the large-scale pretraining dataset may contain some examples from those languages, we did not add any extra data specifically from those languages.
To test the cross-linguistic value of these features, we trained speech recognition models on  low-resource languages ( §4.1) and compare the relative reduction in WER by switching from standard spectrogram features and the learned representations. As these are very small datasets, we trained the same DeepSpeech2 small architecture with the Adam optimizer with a fixed learning rate of 0.0002 and gradient clipping with maximum norm 25.0 for all languages. Figure 2 summarizes results. Again, we find that the CPC-8k representations outperform other features by a large margin and that the models trained on the representations trained on using the audio of (English-only) LibriSpeech do not perform even as well as basic spectrogram features. This suggests that the representations learned on large-scale data capture a phonetic space that generalizes across different languages, but that diversity of linguistic inputs is crucial for developing this universality.

Multilingual Transfer
As a final exploration of the transferability of the representations, we evaluate the representations on a diverse language set of languages with varying amounts of training data and compare the relative reductions in word error rate we obtain when using standard features and switching to the CPC-8k representations. As most of the dataset are small, we trained DeepSpeech2 small models with the Adam optimizer with a fixed learning rate of 0.0002 and applied gradient clipping with maximum norm 25.0, using the same configuration for all languages.  Since the experiments above showed that CPC-LibriSpeech features performed badly, we only compare the relative error rediction with CPC-8k features over spectrogram features. In all cases, we find that the CPC-8k representations improve performance relative to spectorgram feature baselines. The largest improvement was obtained on Sundanese where the WER with spectrogram was 27.85 but dropped to 11.49 using CPC-8k features.
Discussion As our pre-training data did not have any language labels, it is unclear how many samples were seen for each language during pretraining. However, it is important to know that the uncurated multilingual pre-training can improve speech recognition performance on many languages. These results suggests, in practice, that one could use a universal speech feature extractor for many languages instead of training one for each language individually (Kannan et al., 2019).

Control: English Speech Recognition
Thus far, we have focused on robustness and transferability and seen that CPC-8k features offer considerable benefits in these dimensions compared to traditional features. It remains to demonstrate how well they work in powerful architectures where large amounts of labeled training data is available.
To test this, we used 10% and 100% portions of Lib-riSpeech dataset to train speech recognition models, again comparing different features. Our architecture is a standard TDNN. The speech recognition models are trained in the similar way as standard models (Collobert et al., 2016;Kuchaiev et al., 2018). The models are trained with Adam optimizer with learning rate 0.0002 and gradient clipping with a maximum norm 5.0 together with the polynomial learning rate decay method with power 2.0 is used over 200 epochs. 5 Table 3 summarizes the results with TDNN models trained on different sizes of LibriSpeech dataset. We see that even if the speech recognition models have a large number of parameters and are trained on plenty of supervised data, the learned representations still provide significant improvements. The pattern continues to hold if we use beam search decoding with a language model. 6 Our + LM decoding results are comparable to the OpenSeq2Seq benchmark, since we used the exact same LM and decoding algorithm as they used (Kuchaiev et al., 2018).
Although better results contain be obtained using newer architectures than TDNN (Park et al., 2019;Synnaeve et al., 2019), it still represents a standard and important recognition architecture and the results prove that the representations learned from diverse and noisy data can improve large speech recognition model on English in both low-data and high-data regimes.

Related Work
Unsupervised learning played an import role in the reintroduction of deep networks to speech processing (Hinton et al., 2012), as well as other application areas (Hinton et al., 2006;Bengio et al., 2007;Vincent et al., 2010). After a period of focusing on supervised techniques, unsupervised representation learning has recently seen a resurgence in a variety of modalities (Doersch and Zisserman, 2017;van den Oord et al., 2018;Donahue and Simonyan, 2019;Bachman et al., 2019) and has led to improved results, especially in low-data regimes (Hénaff et al., 2019;. In natural language processing, pretrained representations can outperform state-of-the-art system even in high data regimes (Mikolov et al., 2013;Devlin et al., 2019).
The last two years have produced a large amount of work on unsupervised speech representation learning. Some of this work has been evaluated in terms of its ability to perform phone recognition and similar audio classification tasks (van den Oord et al., 2018). Like us, Schneider et al. (2019);  applied learned representations to speech recognition tasks and evaluated on how well in-domain WER was improved. However, as we argued in the paper, such an evaluation misses the opportunity to assess whether these systems become more robust to domain shift and to what extent the learned representations appropriate for different languages.
Finally, the ZeroSpeech challenges have explicitly looked at correlations between learned representations and phonetic structures that generalize across many languages and adapt to new speakers (Dunbar et al., 2017(Dunbar et al., , 2019. Kahn et al. (2019b); Rivière et al. (2020) learned representations with contrastive predictive coding on 60,000 hours of English speech and could show that their representations are correlated well with phonetic structure of English and other languages; however, they did not evaluate these representations in a supervised speech recognizer.
Recently, there have been considerable improvements in purely supervised speech recognition systems. Data augmentation (Park et al., 2019), selftraining (Synnaeve et al., 2019;Kahn et al., 2019a) have advanced the state-of-the-art performance on   English speech recognition. It is likely that augmentation methods are orthogonal to the proposed improvements on universal speech representation learning, and that one could combine both to improve results even further. Additionally, the impact of data augmentation and self-training can be further assessed in terms of its impact on robustness using the methods proposed in this paper.

Conclusion
We have introduced an unsupervised speech representation learning method that discovers acoustic representations from up to 8000 hours of diverse and noisy speech data. We have shown, for the first time, that such pretrained representations lead speech recognition systems to be robust to domain shifts compared to standard acoustic representations, and compared to representations trained on smaller and more domain-narrow pretraining datasets. These representations were evaluated on a standard speech recognition setup where the models are trained and evaluated on in-domain data and also on transfer tasks where the models are evaluated on out-of-domain data. We obtained consistent improvements on 25 phonetically diverse languages including tonal and low-resource languages. This suggests we are making progress toward models that implicitly discover phonetic structure from large-scale unlabelled audio signals.

A Multilingual evaluation datasets
For the multilingual evaluation, we only include (labeled) datasets from OpenSLR that containing more than 1GB of audio. When there is more than one dataset available for one language, we used the largest dataset.