SRPOL’s System for the IWSLT 2020 End-to-End Speech Translation Task

We took part in the offline End-to-End English to German TED lectures translation task. We based our solution on our last year’s submission. We used a slightly altered Transformer architecture with ResNet-like convolutional layer preparing the audio input to Transformer encoder. To improve the model’s quality of translation we introduced two regularization techniques and trained on machine translated Librispeech corpus in addition to iwslt-corpus, TEDLIUM2 andMust_C corpora. Our best model scored almost 3 BLEU higher than last year’s model. To segment 2020 test set we used exactly the same procedure as last year.


Introduction
This paper describes the submission to IWSLT 2020 End-to-End Speech Translation task by Samsung R&D Institute, Poland.
We propose a few improvements to our previous system. Introducing additional training data gave us 0.7 BLEU improvement. Spectrogram augmentation techniques increased quality by 0.2 BLEU. Encoder layer depth of 12 layers gives an increase of BLEU by 1 point and 1.8 points when combined with additional training data and two spectrogram augmentation techniques. Replacing simpler convolutions with ResNet-like convolutional layers gave around 0.5 BLEU improvement. Combining all of these and increasing embedding size to 512 resulted in almost 3 BLEU improvement compared to our last year's model. Document structure is as follows. Firstly we describe data preparation and augmentation. Then we provide system specification and training procedure used in our experiments. We describe data segmentation algorithm used to segment test sets TED 2019 and TED 2020. We show results of our experiments. Finally we conclude our results.

Training Data
To train our system we used only IWSLT 2020 permissible audio corpora -iwslt-corpus, TEDLIUM2 (Rousseau et al., 2014), Must C corpus (Di Gangi et al., 2019) and machine translated Librispeech (Panayotov et al., 2015) corpus. We did not do any further data preparation in case of iwslt-corpus and TEDLIUM2, we used the same data as in 2019. In case of Must C corpus, this year we ran a training with half of translations being synthetic. This improved the score by 0.35 BLEU. We used the same text translation models as last year to generate synthetic translations. Both models scored around 33.8 and 31.1 BLEU on tst2010 and tst2015 sets respectively.

Data filtration
We trained English ASR system that was used to filter iwslt-corpus and TEDLIUM2 corpora. We removed cases where WER score exceeded 75% when comparing ASR output and English reference. We decided that Must C corpus does not need filtration. Additionally, we filtered iwslt-corpus with regard to quality of translation using statistical dictionary-based methods. Size of the corpora before and after filtration is shown in Table 1. We did not filter Librispeech corpus.

Synthetic target data
TEDLIUM2 corpus did not provide any German translations, therefore we generated synthetic targets using two Transformer Big MT systems trained with different hyperparameters on WMT data -Paracrawl, Europarl and OpenSubtitles. Training data for these systems has been prepared with our in-house data preparation pipeline. We also used synthetic translations as an alternative translation in iwslt-corpus when augmenting it. To diversify target data as much as possible, for each example created in augmentation process, we generated 4 translations, 2 per each MT model. Such a technique was described in (Jia et al., 2019). Number of training examples with synthetic data are shown in Table 6. Last year we did not examine the effectiveness of synthetic translations on our models. This year we altered our corpus and included synthetic translations of Must C corpus. This corpus was augmented 3 times and in the case of 2 versions out of 4, synthetic translations were used.

Data Augmentation
We augmented the data by processing the audio files with three Sox's effects: tempo, speed and echo. We sampled the parameters with uniform distribution within ranges presented in Table 3. For each file we repeated the process four times. Librispeech was augmented only once because it is the largest corpus and it is out of domain. As a result we had nearly five times larger audio corpus. The range of speed option is very small because we did not want our model to train on an unnaturally

E2E Speech Translation System
In this section we will describe the architecture and training techniques of our end-to-end spoken language translation system. Some of these were used in our 2019 system.

ASR Transformer for SLT
As a baseline system we used our last year's Transformer architecture implemented in TensorFlow.
The Baseline Transformer has hidden layer of size 384, convolutional (kernel size 9) feed forward layer of size 1536, 8-head self-attention, 6 encoder layers and 4 decoder layers. Audio data is turned into log mel spectrogram with frame size of 25 ms, frame step 10 ms and 80 filters. To log mel spectrograms we apply 2D 3x3 convolution twice with stride 2x2 and 256 filters and then 3x20 convolution to reduce the spectrogram to a 384 dimentional vector, exactly like in the case of ASR. Apart from baseline, we propose changes to convolutional layer and increase number of encoder layers and embedding size. Finally, our best system has ResNet-like convolutional layer, 12 encoder layers and embedding size of 512.

Dual learning: ASR and SLT tasks
This year we also introduced a second decoder with ASR task, making it a multitask setup similar to (Anastasopoulos and Chiang, 2018). A separate dictionary of size 32k was used for this task. In such a setup loss is calculated with two targetsone in English and one in German. Two decoders with different weights are simultaneously trained on these targets; convolutional layers and encoder are shared. An early experiment on non-augmented data showed almost 2 BLEU increase (15.23 vs 17.15 on tst2010) compared to the same model trained on a single task. All our trainings this year used dual learning.

Spectrogram augmentation
To augment data we implemented spectrogram masking technique described in (Park et al., 2019) This technique involves masking the spectrogram for a range of frequencies and periods of time. In our implementation we chose to introduce three such masks for frequency. The width of frequency range is selected randomly between values 5 and 10. This means that out of 80 filters 15 to 30 are masked. In time we chose one mask for every 300 time steps. Again, the length of such mask is random between 10 and 20 time steps. Apart from last year's data augmentation techniques used in 2019, we introduced additional regularization techniques -warp ( Figure 2) and spectrogram noise. We implemented warping technique similar to Park et al. (Park et al., 2019). For each 10 time steps in spectrogram we delete one random time step and insert a step which is an average of two neighboring steps. The result is very similar to warp distortion -some parts of a spectrogram are shifted to the right and some are shifted to the left. We also  introduced a multiplicative noise on spectrogram with a value of +-1%. Table 5 below shows the results of these techniques. We experimented with randomly varying step and window size of spectrograms during the training.
Step size was varied between 8 and 12 ms and window size between 23 and 27 ms. This however gave mixed results and we did not include it in the final model. Figure 5 shows clear advantage of models with warp and spectrogram noise. The advantage of the model with varied step/window size is dubious.

Synthetic Must C and Librispeech data
Addition of synthetic Must C translations improved BLEU score by over 0.35 BLEU. Addition of Librispeech corpus further improved BLEU by 0.35. Table 6 below shows the results.

12 layer encoder
Our experience with text translation suggests it is more efficient to increase number of layer in the encoder rather than decoder therefore we increased number of encoder layers to 12. Number of decoder layers stayed the same at 4 layers. This increased BLEU score by 1 point. Introducing warping, noise and Librispeech corpus increased BLEU by another 0.8 BLEU.

ResNet-like convolutional layers
Another improvement to our model is ResNet-like convolutional layers processing the spectrogram input. The idea is to make the convolutional layers deeper instead of using large number of channels. The spectrogram input is shrinked gradually in both axis using 2x2 pooling. As the spectrogram is shrinked channel, size is increasing. We start with a smaller channel size compared to the previous solution -64 channels instead of 256 and end the convolutional processing with 256 channels. Figure 4 shows architectural diagram of our solution. Replacing previous architecture with ResNet improved the model by around 0.5 BLEU. Table  8 shows the results. Note that strictly maximal scores do not show the improvement well. Figure  5 shows a plot of the results which proofs stable advantage (around 0.5) of ResNet solution. We experimented with a version without the residual connections however decided not to include it in the final model.

Training process
We trained our models on 4 GTX 1080 Ti GPUs for about one and a half week, which resulted in 1.2M steps. 12layer models were trained slightly longerfor 3 weeks because the training was slower. They were also trained on two NVIDIA Quadro 8000 GPUs because of its memory size. Batch size was 400000 timesteps for trainings on 1080 Ti and 3000000 timesteps on Quadro 8000. In the case of all trainings (except one) 10% dropout was applied. The final model with 12 layer and 512 embedding size used 20% dropout. Adam Multistep optimiser was used, effective batch size was increased 32 times.

Model averaging
For the final validation we averaged last 7 checkpoints of the training. Averaging checkpoints almost always resulted in higher BLEU scores. We experimented with continuation of training after averaging but it did not give any better results.

Final model
In this paper we presented improvements on simpler models.

Segmentation
This year we used the same segmentation technique as last year. It relies on dividing the audio input densely using silence detection tool. These small fragments are then joined together up to a certain length depending on the length of the silence between them. Shorter distances between segments are joined earlier. This procedure is repeated until further joining results in segments longer than maximal length. Last year we determined such length should be 11s. However, for our current best model this distance turned out to be 15s. We used tst2015 to optimize the process. We present the result in Table 10.

Evaluation
We have improved our score on tst2019 by 4 BLEU compared to our last year's submission. It is importnat to notice the difference between given and our custom segmentation. Our method produces longer segments than the ones in the given segmentation.
Our models seem to work much better on these longer segments giving around 3.9 BLEU higher scores.

Conclusions
In this paper we have presented a significant improvement of translation quality of our end-toend model. We have shown that despite limited parallel training data, end-to-end systems can compete with traditional pipeline systems. Using a longer segmentation, our model outscored the best IWSLT2019 pipeline system on tst2019(iws, 2019).