NAIST’s Machine Translation Systems for IWSLT 2020 Conversational Speech Translation Task

This paper describes NAIST’s NMT system submitted to the IWSLT 2020 conversational speech translation task. We focus on the translation disfluent speech transcripts that include ASR errors and non-grammatical utterances. We tried a domain adaptation method by transferring the styles of out-of-domain data (United Nations Parallel Corpus) to be like in-domain data (Fisher transcripts). Our system results showed that the NMT model with domain adaptation outperformed a baseline. In addition, slight improvement by the style transfer was observed.


Introduction
Neural Machine Translation (NMT) has significantly improved the quality of Machine Translation (MT) (Bahdanau et al., 2014;Sutskever et al., 2014;Luong et al., 2015). However, domain-specific translation is still difficult in low-resource scenarios, although high performance can be achieved in resource-rich scenarios (Chu and Wang, 2018). Another major problem is the difficulty in translating noisy input sentences including filler, hesitation, etc. Belinkov and Bisk (2017) suggests the difficulty in learning to translate noisy sentences compared to clean ones. The translation of noisy sentences is very important for spoken language translation. In the IWSLT 2020 Conversational Speech Translation Task, we are going to tackle these two problems.
The task includes speech-to-text and textto-text translation from disfluent Spanish speeches/transcripts to fluent English text. We chose the text-to-text subtask for our challenge task participation. The data for this task consists of about 130K bilingual pairs, would not be enough to learn a highly accurate NMT (Koehn and Knowles, 2017). In such a low-resource scenario, one promising way is domain adaptation using out-of-domain parallel corpora and in-domain monolingual corpora (Wang et al., 2016;Chu et al., 2017).
In domain adaptation, the "similarity" between in-domain and out-of-domain data affects the translation accuracy significantly (Koehn and Knowles, 2017). A domain can be defined by any property of the training data such as topic and style. We expect that the domain similarity comes from these properties.
Let us return to the task description. In the task, the inputs are conversational speech transcripts by Automatic Speech Recognition (ASR). They can include ASR errors as well as disfluent and nongrammatical utterances in spontaneous speech. In contrast, the outputs are fluent sequences. In other words, the purpose of this task is to translate disfluent transcripts into fluent sentences. As mentioned before, domain adaptation is a common practice in a low-resource scenario. However, it is difficult to prepare external parallel data in a disfluent source language and a fluent target language. Although fluent written parallel data are widely available, the effects of training with them are limited because the style of the input sentences differs from the indomain data. We need a new strategy for training that can effectively use out-of-domain data with low similarity to in-domain data.
In this paper, we propose a novel domain adaptation method through style transfer of out-of-domain data using unsupervised machine translation. We increase the similarity between out-of-domain and in-domain data by transferring out-of-domain fluent input sentences into disfluent styles. This enables effective domain adaptive training and provides a robust NMT system for noisy input sentences.

System Details
Our method consists of two components: (1) Style Transfer model from fluent to disfluent Spanish.
(2) Translation model from disfluent Spanish to fluent English, as illustrated in Figure 1. First, we transferred fluent Spanish in out-of-domain data into disfluent Spanish (Section 2.1). Then we trained the NMT model leveraging both out-of-domain parallel data as well as in-domain parallel data (Section 2.2).

Unsupervised Style Transfer
We employed an unsupervised learning method for the style transfer of Spanish of out-of-domain data. This is because there is no parallel corpus of fluent and disfluent Spanish and it is not possible to adapt supervised learning methods. Artetxe et al. (2018); Lample et al. (2018a,b) proposed Unsupervised Neural Machine Translation (UNMT) that learns the translation using monolingual corpora of two languages. In this system, we built a fluent-todisfluent style transfer model based on UNMT with out-of-domain fluent data and in-domain disfluent data.

Domain Adaptation
For the challenge task, we apply fine-tuning, which is one of the conventional domain adaptation methods of MT (Sennrich et al., 2016a). The fine-tuning can result in significant improvements compared to both only in-domain training or only out-ofdomain training (Dakwale and Monz, 2017). In this method, an NMT is pre-trained on a resource rich out-of-domain data until convergence, and then its parameters are fine-tuned on a low-resource indomain data.
In this study, we pre-trained the NMT model on the pseudo in-domain data generated in 2.1, and 3 Results

Datasets
We used the LDC Fisher Spanish speech (disfluent) with new English translations (fluent) (Post et al., 2013;Salesky et al., 2018) as parallel in-domain data and the United Nations Parallel Corpus (UN-Corpus) (Ziemski et al., 2016) as parallel out-ofdomain data. Fisher has the following multi-way parallel data distributed by the task organizer:  (Sennrich et al., 2016b) to split sentences into subwords. The vocabulary size was set to 32,000 and sentences longer than 175 subwords were excluded from the training. We apply lowercasing and punctuation removal to UNCorpus same as Fisher corpus.
Model We used the implementation of UNMT 1 by Lample et al. (2018b). UNMT model was based on Transformer (Vaswani et al., 2017). Our models follow the suggested parameters from implementation of UNMT. We used three-layer shared encoder and shared decoder. We set the word embedding dimensions, hidden state dimensions, feed-forward dimensions to 512, 512, and 2048, respectively. We employed eight attention heads for both the encoder and the decoder. We chose Adam (Kingma and Ba, 2014) with a learning rate of 0.0001, β 1 = 0.9, β 2 = 0.999 as the optimizer. Each mini-batch contained 16 sentences.
In order to gain robustness to the content of the sentence, we first pre-trained the model using only UNCorpus/Train. During pre-training, early stopping was applied on the BLEU score between source sentences and back-translated sentences of the UNCorpus/Dev with a patience of 10 iterations, and the model with the highest score was stored. After that, additional training of 1 iteration using the Fisher/Train was performed.   Table 2 shows the perplexity of the language model for the Fisher/train. By transferring the fluent UN-Corpus into the disfluent Fisher tone (Fisher-like UNCorpus) reduced the perplexity and number of unknown words.

NMT with Domain Adaptation
We trained the NMT models which translate from disfluent Spanish to fluent English.

Experimental Settings
Data For training data, we used Fisher/train as in-domain data and UNCorpus/Train and Fisherlike UNCorpus/Train as out-domain data. Fisherlike UNCorpus has the same number of sizes as UNCorpus. During training, we used Fisher/Dev as a validation set. Fisher/Test was used for evaluation. We preprocessed the data in the same way as in the previous experiment. However, for practical use, lowercasing and punctuation removal were applied only to the source language.
Model We used OpenNMT-py 3 . The NMT model was based on Transformer. The hyperparameters of the model almost follow the transformer base settings (Vaswani et al., 2017). Note that in the Fisher-only experiment without domain adaptation, the batch size was halved to 2048 tokens. The model was trained for 20,000 iterations using out-of-domain data, and then fine-tuned for 1,000 iterations using in-domain data. The model parameters saved every 100 iterations.
Evaluation To evaluate the performance, we calculated the BLEU scores (Papineni et al., 2002) with sacreBLEU 4 .

Results
Tables 3 and 4 show the BLEU scores of the systems evaluated with single fluent references. In Table 3, "Fisher", "UNCorpus" and "Fisher-like UNCorpus" are models trained on a single training data. "UNCorpus + Fisher" and "Fisher-like UN-Corpus + Fisher" are models that were pre-trained on UNCorpus and Fisher-like UNCorpus and then fine-tuned on Fisher/Train, respectively. The models in Table 4 did not use Fisher's fluent references when training but instead used disfluent references. Both with and without Fisher's fluent references, domain adaptation training outperformed the baseline. Furthermore, when the pseudo-disfluent Spanish generated by the style transfer was used for training, the score was better than the use of the original UNCorpus without the style transfer. We submitted six systems in total: "Fisher", "UNCorpus + Fisher" and "Fisher-like UNCorpus + Fisher" in Table 3, and all of Table 4.

Discussion
Effect of Style Transfer In domain adaptation training, the accuracy was slightly improved by transferring the style of out-of-domain data to be like in-domain data. This shows that there is some significance in increasing the similarity between domains through style transfer.
However, when we did not perform domain adaptation and only trained with out-of-domain data, the accuracy for in-domain data was reduced by style transfer. The following is an example of style transferred sentence: nueva york 1 a 12 de junio de 2015 (original) nueva york oh a mi eh de de de de (generated) As shown above, some generated sentences lost the meaning of the sentence due to missing phrases. As a result, the quality of the parallel data decreased and the final translation performance was also degraded. One of the causes of this problem is style transfer constraints are too strong. Thus, it may be mitigated by a model that could control the tradeoff between style transfer and content preservation (Niu et al., 2017;Agrawal and Carpuat, 2019;Lample et al., 2019).
Further improvement can be expected by preventing changes in the meaning of sentences and converting only the style.
Fluent vs Disfluent references The model trained using Fisher's original disfluent data had a BLEU score of about three points lower than the model trained using the fluent data. In other words, in this task, we found that removing the disfluency of reference sentences improves the BLEU by about three points for all the learning strategies we tried. In domain adaptation, we expected this problem to be mitigated by training on large outof-domain data with fluent reference sentences, but the desired results were not obtained.

Conclusion
In this paper, we presented NAIST's submission to the IWSLT2020 Conversational Speech Translation task. We experimentally show that domain adaptation can improve the translation accuracy of disfluent sentences. Moreover, the translation accuracy was improved by increasing the similarity between domains through style transfer, but the effect was limited due to the parallel data quality degradation.
Furthermore, The loss of accuracy caused by not using clean reference sentences of in-domain data could not be resolved by domain adaptation either.
In future work, we will pursue a style transfer system that does not reduce the quality of the parallel data and use it to improve the translation accuracy of NMT. High-quality style transfer may allow us to acquire robustness to the disfluency of input sentences and to learn fluent outputs by removing the disfluency of output sentences.