Adapting End-to-End Speech Recognition for Readable Subtitles

Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.


Introduction
Automatic speech recognition (ASR) has become ubiquitous in human interaction with digital devices, such as keyboard voice inputs (He et al., 2019) and virtual home assistants (Li et al., 2017). While transcription accuracy is often the primary goal when designing ASR systems, in some use cases the readability of outputs is crucial to user experience. A prominent example is subtitling for TV. In this case, the audience need to multitask, i.e. simultaneously watch video contents, listen to speech utterances, and read subtitles. To avoid a visual overload, not every spoken word needs to be displayed. Meanwhile, the shortened subtitles must still retain the meaning of the spoken content. Moreover, large deviations from the original utterance are also undesirable, as the disagreement with auditory input would create a distraction.
To compress the subtitles, one straightforward approach is to post-process ASR transcriptions. The task of sentence compression has been wellstudied (Knight and Marcu, 2002;Clarke and Lapata, 2006;Rush et al., 2015). In extractive compression (Filippova et al., 2015;Angerbauer et al., 2019), only deletion operations are performed on the input. Despite the simplicity, this approach tends to produce outputs that are less grammatical (Knight and Marcu, 2002). On the other hand, abstractive compression (Cohn and Lapata, 2008;Rush et al., 2015;Chopra et al., 2016;Yu et al., 2018) involves more sophisticated input reformulation, such as word reordering and paraphrasing (Clarke and Lapata, 2006). For the task of compressing subtitles, however, the extent of rewriting must be controlled in order to retain consistency with spoken utterances.
From a practical point of view, building a sentence compression system typically requires training corpora where the target sequences are summarized. For most languages and domains, there exists scarcely any resource suitable for supervised training. This low-resource condition is even more severe for audio inputs. To the best of our knowledge, currently there is no publicly available spoken language compression corpora.
Given the challenges outlined above, this work investigates ASR with output compression. We test our approaches on German TV subtitles. The combination of this task and the use case is to the extent of our knowledge previously unexplored.
The first contribution of this work is a comparison of cascaded and end-to-end approaches to generating compressed ASR transcriptions, where the former consists of separate ASR and compression modules, and the latter integrates transcription and compression. The experiments show that our sentence compression module trained in an unsupervised fashion tends to excessively paraphrase, whereas the end-to-end model can be better adapted to the task of interest. Secondly, we show that, after fine-tuning on a small adaption corpus, an ASR model can perform transcription and compression simultaneously. Without being given explicit length constraints, the adapted model shows increased recognition accuracy on rare words as well as paraphrasing capabilities to produce shorter outputs. Furthermore, by explicitly encoding the length constraints, we achieve further performance gains in addition to those brought by adaptation.

Task
The task of creating readable subtitles for video contents has several unique properties. First, due to limited screen size and reading time, not every spoken word needs to be transcribed, especially when utterances are spoken fast. A full transcription could even hamper user experience due to poor readability. Second, although output shortening is typically realized by deleting non-essential words, the output is not only deletion-based. A real-life example from the German TV program Tagesschau 1 shown in Table 1 contain rephrasing (from "freed from" to "without") in addition to word removal (dropping the word "ethically"). A further requirement is that the subtitles should stay reasonably authentic to the spoken contents, only modifying them when necessary. Otherwise, the disagreement with audio contents could become distracting to the audience.
Within the framework of common NLP tasks, the task of generating readable subtitles combines ASR and abstractive compression, while being subjected to the additional requirements as outlined above. For the baseline ASR model, we use the Transfomer architecture (Vaswani et al., 2017) similar to that by Pham et al. (2019). As there is no spoken language compression corpus available to us that is large enough for training an end-to-end model from scratch, we first train an ASR model without output compression, and then adapt it to our task of interest using a small web-scraped corpus. In the first training stage, the model is solely trained for transcribing speech. In the fine-tuning stage, we let the model continue training at a reduced learning rate on the adaptation corpus with shortened transcriptions. The intended goal of adaptation is to let the model learn the compression task on the basis of the transcription capability acquired before.

End-to-End Length-Constrained ASR
With the baseline introduced above, the ASR model is not aware of the compression task until the adaptation step. If the model already has a sense of output length constraints earlier, i.e. when training for the ASR task, it could better utilize the abundance of training data. Motivated by this hypothesis, we inject information about the allowable output lengths using a count-down at each decoding step, as illustrated in Figure 1. In a vanilla decoder, the hidden state at position i would ingest an embedding of the previously generated token. With the count-down for target length t, decoder state y i additionally ingests a representation of the number of allowed output tokens, t − i. We explore two ways to represent the length count-down. The first one utilizes length embeddings learned during training, motivated by the approach Kikuchi et al. (2016) proposed. Given a target sequence of length t, at decoding time step i, the input to decoder hidden state y i is based on a concatenation of the previous state y i−1 and an embedding of the remaining length where emb(t − i) is an embedding of the number of allowed tokens. To keep the same dimensionality as that of the original word embedding, the output from Equation 1 further undergoes a linear transformation followed by the ReLU activation.
With the length embedding approach, the model learns representations of different length values during training. Therefore, learning to represent rarely-encountered lengths may be difficult.
The second method modifies the trigonometric positional encoding from the Transformer (Vaswani et al., 2017) to represent the remaining length rather than the current position. This method has been applied in summarization (Takase and Okazaki, 2019) and machine translation (Lakew et al., 2019; Niehues, 2020) to limit output lengths. Motivated by these examples from related sequence generation tasks, we explore the "backward" positional encoding in ASR models.
With the original positional encoding, for input dimension d ∈ {0, 1, . . . , D − 1}, the encoding at position i is defined as (2) The backward positional encoding is the same as Equation (2), except that the current position i is replaced by the remaining length t − i. Given a target sequence of length t, the length encoding at decoding step i becomes Like the positional encoding, the length encoding is also summed together with the input embedding to decoder hidden states. Moreover, since the encoding is based on sinusoids, it can be easily extrapolated to lengths unseen during training. This is a potential advantage over the learned length embeddings.

Unsupervised Sentence Compression
Compared to training ASR models to jointly perform transcription and compression, a more straightforward approach is to post-edit ASR outputs using a compression model. However, training such a model in a supervised fashion requires reliable target sequences. Due to the scarcity of suitable training corpora, we choose an unsupervised approach inspired by multilingual translation (Ha et al., 2017;Johnson et al., 2017).
Similar to in Niehues (2020), this approach relies on a multilingual translation system that is trained on several language pairs. At training time, language tokens are embedded together with the source and target sentences. At test time, the model is given the same source and target language token, which is a translation direction unseen in training. Since the multilingual training enables zero-shot translation, the model is able to reformulate the input in the same language. To achieve output compression, the length constraints introduced in Section 3.2 are applied in the decoder. Table 2 provides an overview of audio corpora we use. The baseline ASR model is trained on the German part of LibriVoxDeEn (Beilharz et al., 2020), a recently released corpus consisting of open-domain German audio books. Since the corpus creators did not suggest a train-dev-test partition, we split the dataset ourselves. The test set contains the following books: Jonathan Frock 2 , Jolanthes Hochzeit and Kammmacher.

Datasets
For the spoken language compression adaptation corpus, we collect spoken utterances and subtitles from the German news program Tagesschau from 1 January to 15 August 2019. To control for recording condition and disfluency, we exclude interviews or press conferences and only keep utterances from the news anchors. The utterances are segmented based on the start and end time of subtitles. Since the timestamps do not always precisely correspond to utterance boundaries, we manually verify the test set and edit when necessary.
For the unsupervised compression system, we use the multilingual translation corpus from the IWSLT 2017 evaluation campaign (Cettolo et al., 2017). It consists of English, German, Dutch, Italian and Romanian parallel sentences based on TED  talks. All 10 × 2 translation directions are used in training. At test time, the model is given the same source and target language tag (German in our case) in order to generate summarization in the same language. We use the positional embedding introduced in Section 3.2 for length control.

Preprocessing
We use the Kaldi toolkit (Povey et al., 2011) to preprocess the raw audio utterances into 23dimensional filter banks. We choose not to apply any utterance-level normalization to allow for future work towards online processing. For text materials, i.e. audio transcriptions and the translation source and target sentences, we use byte-pair encoding (BPE) (Sennrich et al., 2016) to create subword-based dictionaries.

Hyperparameters
For the ASR model, we adopt many reported values in the work of Pham et al. (2019), including the optimizer choice, learning rate, warmup steps, dropout rate, label smoothing rate, and embedding dimension. There are several parameters that we choose differently. The size of the inner feed forward layer is 2048. Moreover, we use 32 encoder and 12 decoder layers, and BPE of size 10,000. For the compression model, we use a Transformer with 8 encoder and decoder layers each. 3

Post-Editing with Compression Model
To gain an initial understanding of the task, we start with a more controlled setup, where the test utterances are transcribed by a commercial off-theshelf ASR system. The transcriptions are then postprocessed with our compression model. First, we analyze the level of desired output compression by contrasting the lengths of the off-theshelf transcriptions against those of the references. 3 The code is available at https://github.com/ quanpn90/NMTGMinor/tree/DbMajor.  Table 3: Two examples of various levels of compression, where the first shortens from 17 to 6 words, and the second only removes one word.
In Figure 2, we plot the distribution of the ratio between transcription lengths and target lengths over the test set. The first observation is that most of the transcriptions require shortening, as shown by the high frequencies of ratios over 1.
Moreover, the compression ratio varies across different utterances. Table 3 shows two examples, where the first compressed from 17 to 6 words, while the second only deletes one word. When inspecting the original videos, we notice that the first example contains many other visual contents, some also in text form, whereas the second example only involves the new anchorwoman speaking. The examples showcase that the level of desired compression depends on various factors, such as

Model
Ratio (output to desired length) WER R-1 R-2 R-L  Ground-truth Es ist kurz nach Mitternacht, als plötzlich ein Auto in eine Gruppe von Menschen steuert, die ausgelassen ins neue Jahr feiern. It is just after midnight, when a car suddenly drives into a group of people who joyfully celebrate the new year.

Reference
Kurz nach Mitternacht steuert ein Auto in eine Gruppe von Menschen, die ins neue Jahr feiern. Just after midnight a car drives into a group of people who celebrate the new year.

Unsupervised compression
Kurz nach Mitternacht fährt ein Auto plötzlich in eine Gruppe von Leuten, die das nächste Jahr feiern. Just after midnight a car suddenly drives into a group of people who celebrate the next year.
Ground-truth Unter dem Eindruck der Massenproteste hatten sich zuletzt auch hochrangige Militärs von ihm abgewandt. Under the impression of mass protests, senior military officials have finally also turned away from him.

Reference
Unter dem Eindruck der Proteste wandten sich zuletzt auch hochrangige Militärs ab. Under the impression of protests, senior military officials finally also turned away.

Unsupervised compression
Unter dem Eindruck von Massenprotestieren waren auch hochrangige Militärs von ihm entfernt. Under the impression of mass protesting, senior military officials were also distanced from him. the amount of visual information simultaneously shown on the screen. Therefore, a globally fixed compression rate would not be suitable.
To comply with length constraints, the ASR outputs are shortened by the unsupervised sentence compression model. As the system is trained based on subwords, we use the number of BPE-tokens in the reference as target length. The first two rows in Table 4 contrast the output quality before and after compression, as measured in case-insensitive word error rate (WER), and ROUGE scores (Lin, 2004). 4 To our surprise, compression has a large negative impact on the outputs in all four metrics, creating a gap of over 20% absolute. Via an exhaustive manual inspection over the test set, we find that the unsupervised compression model tends to paraphrase much more frequently than the references. While the paraphrased output is often valid both grammatically and semantically, the deviation from the references leads to higher WER and lower ROUGE scores. Two examples are given in Table 5, where several synonym replacements appear in the compression outputs, e.g. "fährt" for "steuert" (both "drives"), "Leute" for "Menschen" (both "people"), "nächste Jahr" for "neue Jahr" ("next year" for "new year"). In all these places, the references keep the original spoken words unchanged. Given the nature of our task, it is indeed undesirable to paraphrase excessively, as subtitles that are too different from the original spoken utterances could create a cognitive overload to users.
Considering these downsides, an ASR system is trained from scratch to provide more flexibility of structural modification. On our self-partitioned LibriVoxDeEn test set, we achieve a WER of 9.2%. The compression performance is reported in the lower section of Table 4. As this model is only

Model
Ratio (output to desired length) WER R-1 R-2 R-L

Reference
In Brasilien ist Präsident Bolsonaro vereidigt worden. In Brazil new president Bolsonaro has been inaugurated. Before adaptation In Brasilien ist der neue Präsident voll zu Narro vereidigt worden.
In Brazil the new president "voll zu Narro" has been inaugurated.

After adaptation
In Brasilien wurde der neue Präsident Bolsonaro vereidigt. In Brazil the new president Bolsonaro was inaugurated. trained on LibriVoxDeEn audio books, its performance suffers from the train-test domain mismatch. This is exhibited by the gap of nearly 10% to the offthe-shelf system, which is trained on larger volume of data from various domains. Moreover, the transcription errors by the ASR system carry over as the input of the compression model, which is further disadvantageous to the final output quality. Lastly, similar to previous observations, post-processing by the compression model has a negative effect in terms of the evaluation metrics.
Overall, the performance of the unsupervised compression model suffers from paraphrasing. As an anonymous reviewer suggested, the aggressive paraphrasing can be remedied during decoding. For example, given a large beam size, we can select candidates that contain less paraphrasing. Otherwise, training with a paraphrasing penalty could also alleviate the problem. While we have not explored these methods here, they would indeed provide a more complete picture when comparing the cascaded and end-to-end approach.

Fine-Tuning ASR Model for Compression
Our baseline ASR model is trained on a different domain than the test set, and only for the transcription task. To improve performance on our task of interest, we apply fine-tuning on the adaptation corpus. The results are in Table 6. For easy visual comparison, the pre-adaptation performance in the first row is repeated from Table 4. Contrasting the performance before and after adaptation, we see no-ticeable gains brought by adaptation in terms of all four quality metrics. Moreover, as evidenced by the reduced ratio of output to desired length, the model already performs shortening, despite not having received explicit length constraints. Table 7 shows an example where the adapted model changes verb tense to reduce output length. Specifically, by using the simple past ("wurde vereidigt / was inaugurated") instead of the present perfect tense ("ist vereidigt worden / has been inaugurated"), the output becomes shorter. Meanwhile, we also observe the correct transcription of the proper noun "Bolsonaro", which was mistakenly transcribed to the phonetically-similar "voll zu Narro" before adaption. This illustrates that the adaptation step enables the model to improve recognition quality and compress its outputs simultaneously.
Despite the positive observations, the desired length constraints are not yet fully satisfied, as shown by the ratio of 1.05 between output to desired lengths. To examine the scenario of fully obeying the length constraints, we stop decoding once the number of allowed tokens runs out. The result is reported in the last row of Table 6. The ratio of 0.96 is lower than 1 because of a few instances where decoding stops before the countdown reaches zero. As the same output sequence can be constructed by different BPE-units, choosing longer subword units earlier on can lead to reaching the end-of-sequence token before depleting the number of allowed tokens. Lastly, from the quality metrics in the last row of Table 6, we see

Model
Ratio (output to desired length) WER R-1 R-2 R-L (   that the forced termination of decoding comes with higher WER and lower ROUGE scores, indicating reduced output quality when fully satisfying the target length constraints.

Models with Explicit Length Constraints
While the baseline ASR model achieves some degree of compression after adaptation, it cannot fully comply with length constraints. Therefore, the following experiments examine the effects of training with explicit length count-downs. In Table 8, we report the performance of the ASR models with length embedding or encoding, as introduced in Section 3.2. For a complete comparison, in the first two rows of Table 8, we also include the baseline performance with forced termination of decoding. Rows (3) and (5) show the performance of the two length count-down methods before adaptation. As the ratios of output to desired length are equal to 1, the models are always faithful to the given length constraints. This shows the effectiveness of injecting allowable length in training. However, we also observe that there is no quality gain over the baseline in row (1). To investigate the reason for this, we experimented by decoding one sample sequence with different allowable lengths. As we gradually reduce the target length, the models first shorten their outputs by removing punctuation marks. Afterwards, instead of shortening the outputs by summarization, they stop decoding when running out of allowed number of tokens. Indeed, during training, the models are only incentivized to accurately transcribe the spoken utterance and to stop decoding when the count-down reaches zero. The same behavior therefore carries over to test time. Contrary to abstractive summarization (Takase and Okazaki, 2019) and machine translation (Lakew et al., 2019), in ASR, an input sequence has one single ground-truth transcription rather than multiple viable outputs. This could lead to a different level of abstraction than required in summarization or translation models. These observations with the unadapted model also highlight the importance of the subsequent fine-tuning step.
The results after adaptation are reported in rows (4) and (6). The large improvement in WER and ROUGE scores is in line with the previous finding when adapting the baseline. When decoding with length count-down, we define the target output length as the minimum of the baseline output length and the reference length. This ensures the same output-to-desired length ratio as the baseline in row (1). Here, the first observation is that the length encoding model in row (5) outperforms the baseline in three of the four evaluation metrics. This suggests that it is beneficial to represent the constraints explicitly during training. Moreover, the length encoding model also consistently outperforms its embedding-based counterpart. This could be because the length encoding can extrapolate to any length values, and is equipped with a sense of relative differences between numerical values at initialization. On the other hand, the length embedding would need to learn the representation for different lengths during training. When inspecting the outputs for long utterances, we found that the embedding model is more likely to abruptly stop, such as the example shown in Table 9.
6 Related Work

Length-Controlled Text Generation
Controlling output length of natural language generation systems has been studied for several tasks.
For abstractive summarization, Kikuchi et al. (2016) proposed two methods to incorporate length constraints into LSTM-based encoder-decoder models. The first method uses length embedding at every decoding step, while the second adds the desired length in the first decoder state. For convolutional models, Fan et al. (2018) used special tokens to represent quantized length ranges, and provides the desired token to the decoder before output generation. Liu et al. (2018) adopted a more general approach, where the decoder directly ingests the desired length. More recently, Takase and Okazaki (2019) modified the positional encoding from the Transformer (Vaswani et al., 2017) to encode allowable lengths. Makino et al. (2019) proposed a loss function that encourages summaries within desired lengths. Saito et al. (2020) introduced a model that controls both output length and informativeness.
For machine translation, Lakew et al. (2019) used both the length range token and reverse length encoding. Niehues (2020) used the length embedding, encoding, as well as a combination of the original positional encoding and length count-down.

Sentence Compression
Our task of length-controlled ASR outputs is related to sentence compression, as the transcriptions can be compressed in post-processing. An early approach of supervised extractive sentence com-pression was by Filippova et al. (2015), who proposed to predict the delete-or-keep choice for each output symbol. Angerbauer et al. (2019) extended this approach by integrating the desired compression ratio as part of the prediction label. Yu et al. (2018) proposed to combine the merits of extractive and abstractive approaches by first deleting on non-essential words and then generating new words. For unsupervised compression, Fevry and Phang (2018) trained a denoising auto-encoder to reconstruct original sentences, and in this way circumvented the need for supervised corpora.

Conclusion
In this work, we explored the task of compressing ASR outputs to enhance subtitle readability. This task has several unique properties. First, the compression is not solely deletion-based. Moreover, unnecessary paraphrasing must be limited to maintain a consistent user experience between hearing and reading.
We first investigated cascading an ASR module with a sentence compression model. Due to the absence of supervised corpora, the compression model is trained in an unsupervised fashion. Experiments showed that the outputs generated this way do not suit our task requirements because of unnecessary paraphrasing. We then adapted an end-toend ASR model on a small corpus with compressed transcriptions. Via adaptation, the model learned to both shorten its outputs and improve transcription quality. Nevertheless, the given length constraints were not fully satisfied. Lastly, by explicitly injecting length constraints via reverse positional encoding, we achieved further performance gain, while completely adhering to length constraints.
A direction for future work is to incorporate more diverse measurements of output length as well as complexity. In this work, we measured length by the number of BPE-tokens. While this typically corresponds to the output length perceived visually, a more direct metric would be the number of characters. Moreover, output complexity, such as the proportion of long words, is also important for readability and therefore worth exploring. In a broader scope, as an anonymous reviewer suggested, a way to alleviate the resource scarcity for end-to-end ASR compression is to augment the training data with synthesized utterances from summarization corpora. We expect the augmentation to be complementary to our approaches in this work.