Improved Speech Representations with Multi-Target Autoregressive Predictive Coding

Training objectives based on predictive coding have recently been shown to be very effective at learning meaningful representations from unlabeled speech. One example is Autoregressive Predictive Coding (Chung et al., 2019), which trains an autoregressive RNN to generate an unseen future frame given a context such as recent past frames. The basic hypothesis of these approaches is that hidden states that can accurately predict future frames are a useful representation for many downstream tasks. In this paper we extend this hypothesis and aim to enrich the information encoded in the hidden states by training the model to make more accurate future predictions. We propose an auxiliary objective that serves as a regularization to improve generalization of the future frame prediction task. Experimental results on phonetic classification, speech recognition, and speech translation not only support the hypothesis, but also demonstrate the effectiveness of our approach in learning representations that contain richer phonetic content.


Introduction
Unsupervised speech representation learning, which aims to learn a function that transforms surface features, such as audio waveforms or spectrograms, to higher-level representations using only unlabeled speech, has received great attention recently (Baevski et al., , 2020Liu et al., 2020;Song et al., 2019;Jiang et al., 2019;Schneider et al., 2019;Chorowski et al., 2019;Pascual et al., 2019;Oord et al., 2018;Kamper, 2019;Chen et al., 2018;Milde and Biemann, 2018;Chung et al., 2016;Hsu et al., 2017). A large portion of these approaches leverage self-supervised training, where the learning target is generated from the input itself, and thus can train a model in a supervised manner. Chung et al. (2019) propose a method called Autoregressive Predictive Coding (APC), which trains an RNN to predict a future frame that is n steps ahead of the current position given a context such as the past frames. The training target can be easily generated by right-shifting the input by n steps. Their intuition is that the model is required to produce a good summarization of the past and encode such knowledge in the hidden states so as to accomplish the objective. After training, the RNN hidden states are taken as the learned representations, and are shown to contain speech information such as phonetic and speaker content that are useful in a variety of speech tasks (Chung and Glass, 2020).
Following their intuition, in this work we aim to improve the generalization of the future frame prediction task by adding an auxiliary objective that serves as a regularization. We empirically demonstrate the effectiveness of our approach in making more accurate future predictions, and confirm such improvement leads to a representation that contains richer phonetic content.
The rest of the paper is organized as follows. We start with a brief review of APC in Section 2. We then introduce our approach in Section 3. Experiments and analysis are presented in Section 4, followed by our conclusions in Section 5.

Autoregressive Predictive Coding
Given a context of a speech signal represented as a sequence of acoustic feature vectors (x 1 , x 2 , . . . , x t ), the objective of Autoregressive Predictive Coding (APC) is to use the context to infer a future frame x t+n that is n steps ahead of x t . Let x = (x 1 , x 2 , . . . , x N ) denote a full utterance, where N is the sequence length, APC incorporates an RNN to process each frame x t sequentially and update its hidden state h t accordingly. For t = 1, . . . , N − n, the RNN produces Figure 1: Overview of our method. L f is the original APC objective that aims to predict x t+n given a context (x 1 , x 2 , . . . , x t ) with an autoregressive RNN. Our method first samples an anchor position, assuming it is time step t. Next, we build an auxiliary loss L r that computes L f of a past sequence (x t−s , x t−s+1 , . . . , x t−s+ −1 ) (see Section 3.1 for definitions of s and ), using an auxiliary RNN (dotted line area). In this example, (n, s, ) = (1, 4, 3). In practice, we can sample multiple anchor positions, and averaging over all of them gives us the final L r . an output y t = W · h t , where W is an affinity matrix that maps h t back to the dimensionality of x t . The model is trained by minimizing the frame-wise L1 loss between the predicted sequence (y 1 , y 2 , . . . , y N −n ) and the target sequence (x 1+n , x 2+n , . . . , x N ): (1) When n = 1, one can view APC as an acoustic version of neural LM (NLM) (Mikolov et al., 2010) by thinking of each acoustic frame as a token embedding, as they both use a recurrent encoder and aim to predict information about the future. A major difference between NLM and APC is that NLM infers tokens from a closed set, while APC predicts frames of real values.
Once an APC model is trained, given an utterance (x 1 , x 2 , . . . , x N ), we follow Chung et al. (2019) and take the output of the last RNN layer (h 1 , h 2 , . . . , h N ) as its extracted features.

Proposed Methodology
Our goal is to make APC's prediction of x t+n given h t more accurate. In Section 4 we will show this leads to a representation that contains richer phonetic content.

Remembering more from the past
An overview of our method is depicted in Figure 1. We propose an auxiliary loss L r to improve the generalization of the main objective L f (Equation 1).
The idea of L r is to refresh the current hidden state h t with the knowledge learned in the past. At time step t, we first sample a past sequence p t = (x t−s , x t−s+1 , . . . , x t−s+ −1 ), where s is how far the start of this sequence is from t and is the length of p t . We then employ an auxiliary RNN, denoted as RNN aux , to perform predictive coding defined in Equation 1 conditioning on h t . Specifically, we initialize the hidden state of RNN aux with h t , and optimize it along with the corresponding W aux using L For a training utterance x = (x 1 , x 2 , . . . , x N ), we select each frame with probability P as an anchor position.
Assume we end up with M anchor positions: a 1 , a 2 , . . . , a M . Each a m defines a sequence p am = (x am−s , x am−s+1 , . . . , x am−s+ −1 ) before x am , which we use to compute L f (p am ). Averaging over all anchor positions gives the final auxiliary loss L r : The final APC objective combines Equations 1 and 2 with a balancing coefficient λ: We re-sample the anchor positions for each x during each training iteration, while they all share the same RNN aux and W aux .

Experiments
We demonstrate the effectiveness of L r in helping optimize L f , and investigate how the improvement is reflected in the learned representations.

Setup
We follow Chung et al. (2019) and use the audio portion of the LibriSpeech (Panayotov et al., 2015) train-clean-360 subset, which contains 360 hours of read speech produced by 921 speakers, for training APC. The input features are 80dimensional log Mel spectrograms, i.e., x t ∈ R 80 . Both RNN and RNN aux are a 3-layer, 512-dim unidirectional GRU (Cho et al., 2014) network with residual connections between two consecutive layers . Therefore, W, W aux ∈ R 512×80 . λ is set to 0.1 and the sampling probability P is set to 0.15, that is, each frame has a 15% of chance to be selected as an anchor position. P and λ are selected based on the validation loss of L f on a small data split. All models are trained for 100 epochs using Adam (Kingma and Ba, 2015) with a batch size of 32 and a learning rate of 10 −3 .

Effect of L r
We first validate whether augmenting L r improves L f . As a recap, n is the number of time steps ahead of the current position t in L f , and s and denote the start and length, respectively, of a past sequence before t to build L r . We consider (n, s, ) ∈ {1, 3, 5, 7, 9} × {7, 14, 20} × {3, 7}. Note that each phone has an average duration of about 7 frames. Figures 2a and 2b present L r (before multiplying λ) and L f of the considered APC variants on the LibriSpeech dev-clean subset, respectively. Each bar of the same color represents one (s, ) combination. We use (−, −) to denote an APC optimized only with L f . Bars are grouped by their n's with different (s, ) combinations within each group.
We start with analyzing Figure 2a. Note that L r does not exist for (−, −) and is set to 0 in the figure. We see that under the same n, the performance of L r is mainly decided by how far (s) the past sequence is from the current position rather than the length ( ) to generate: when we keep fixed and increase s from 7 (red), 14 (green), to 20 (blue), we observe the loss surges as well. From Figure 2b, we have the following findings.
For a small n, the improvement in L f brought by L r is minor. By comparing (−, −) with other bars, we see that when n ≤ 3, which is smaller than half of the average phone duration (7 frames), adding L r does not lower L f by much. We speculate that when n ≤ 3, x t+n to be inferred is usually within the same phone as x t , making the task not challenging enough to force the model to leverage more past information.
L r becomes useful when n gets larger. We see that when n is close to or exceeds the average phone duration (n ≥ 5), an evident reduction in L f after adding L r is observed, which validates the effectiveness of L r in assisting with the optimization of L f . When n = 9, the improvement is not as large as when n = 5 or 7. One possible explanation is that x t+9 has become almost independent from the previous context h t and hence is less predictable. By observing the validation loss, we have shown that L r indeed helps generalize L f .

Learned representation analysis
Next, we want to examine whether an improvement in L f leads to a representation that encodes more useful information. Speech signals encompass a rich set of acoustic and linguistic properties. Here we will only focus on analyzing the phonetic content contained in a representation, and leave other properties such as speaker for future work. We use phonetic classification on TIMIT (Garofolo et al., 1993) as the probing task to analyze the learned representations. The corpus contains 3696, 400, and 192 utterances in the train, validation, and test sets, respectively. For each n ∈ {1, 3, 5, 7, 9}, we pick the (s, ) combination that has the lowest validation loss. As described in Section 2, we take the output of the last RNN layer as the extracted features, and provide them to a linear logistic regression classifier that aims to correctly classify each frame into one of the 48 phone categories. During evaluation, we follow the protocol (Lee and Hon, 1989) and collapse the prediction to 39 categories. We report frame error rate (FER) on the test set, which indicates how much phonetic content is contained in the representations. We also conduct experiments for the task of predicting x t−w and x t+w given x t for w ∈ {5, 10, 15}. This examines how contextualized h t is, that is, how much information about the past and future is encoded in the current feature h t . We simply shift the labels in the dataset by {±5, ±10, ±15} and retrain the classifier. We keep the pre-trained APC RNN fixed for all runs. Results are shown in Table 1.
We emphasize that our hyperparameters are chosen based on L f and are never selected based on their performance on any downstream task, including phonetic classification, speech recognition, and speech translation to be presented next. Tuning hy-perparameters towards a downstream task defeats the purpose of unsupervised learning.
Phonetic classification We first study the standard phonetic classification results, shown in the column where time shift is 0. We see that APC features, regardless of the objective (L f or L m ), achieve lower FER than log Mel features, showing that the phonetic information contained in the surface features has been transformed into a more accessible form (defined as how linearly separable they are). Additionally, we see that APC features learned by L m outperform those learned by L f across all n. For n ≥ 5 where there is a noticeable improvement in future prediction after adding L r as shown in Figure 2b, their improvement in phonetic classification is also larger than when n ≤ 3. Such an outcome suggests that APC models that are better at predicting the future do learn representations that contain richer phonetic content. It is also interesting that when using L f , the best result occurs at n = 5 (31.9); while with L m , it is when n = 7 that achieves the lowest FER (27.8).
Predicting the past or future We see that it is harder to predict the nearby phone identities from a log Mel frame, and the FER gets higher further away from the center frame. An APC feature h t contains more information about its past than its future. The result matches our intuition as the RNN generates h t conditioning on h i for i < t and thus their information are naturally encoded in h t . Furthermore, we observe a consistent improvement in both directions by changing L f to L m across all n and time shifts. This confirms the use of L r , which requires the current hidden state h t to recall what has been learned in previous hidden states, so more information about the past is encoded in h t . The improvement also suggests that an RNN can forget the past information when training only with L f , and adding L r alleviates such problem.

Speech recognition and translation
The above phonetic classification experiments are meant for analyzing the phonetic properties of a representation. Finally, we apply the representations learned by L m to automatic speech recognition (ASR) and speech translation (ST) and show their superiority over those learned by L f .
We follow the exact setup in Chung and Glass (2020). For ASR, we use the Wall Street Journal corpus (Paul and Baker, 1992), use si284 for training, and report the word error rate (WER) on dev93. For ST, we use the LibriSpeech En-Fr corpus (Kocabiyikoglu et al., 2018), which aims to translate an English speech to a French text, and report the BLEU score (Papineni et al., 2002). For both tasks, the downstream model is an end-to-end, sequenceto-sequence RNN with attention (Chorowski et al., 2015). We compare different input features to the same model. Results, shown in Table 2, demonstrate that the improvement in predictive coding brought by L r not only provides representations that contain richer phonetic content, but are also useful in real-world speech applications.

Conclusions
We improve the generalization of Autoregressive Predictive Coding by multi-target training of fu-1 According to Chung and Glass (2020), when using a Transformer architecture (Vaswani et al., 2017;Liu et al., 2018) as the autoregressive model, representations learned with L f can achieve a WER of 13.7 on ASR and a BLEU score of 14.3 on ST.
ture prediction L f and past memory reconstruction L r , where the latter serves as a regularization. Through phonetic classification, we find the representations learned with our approach contain richer phonetic content than the original representations, and achieve better performance on speech recognition and speech translation.