Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation

Recently, neural machine translation is widely used for its high translation accuracy, but it is also known to show poor performance at long sentence translation. Besides, this tendency appears prominently for low resource languages. We assume that these problems are caused by long sentences being few in the train data. Therefore, we propose a data augmentation method for handling long sentences. Our method is simple; we only use given parallel corpora as train data and generate long sentences by concatenating two sentences. Based on our experiments, we confirm improvements in long sentence translation by proposed data augmentation despite the simplicity. Moreover, the proposed method improves translation quality more when combined with back-translation.


Introduction
Neural machine translation (NMT) can be used to achieve high translation quality. However, it has certain drawbacks, such as the degradation in the translation quality for long sentences. Koehn and Knowles (2017) reported that the translation quality of NMT is superior to that of statistical machine translation (SMT) for input sentences within a certain length. However, they also stated that when the sentence length exceeds a particular value, the quality of NMT becomes inferior to that of SMT, and the greater the sentence length, the lower the translation quality.
Additionally, they presented the correlation between the size of the training data and the translation quality (Koehn and Knowles, 2017). In other words, the less training data we have, the lower will be the accuracy of the translation. This issue is prevalent in low-resource languages. There- * Current affiliation: Recruit Co., Ltd. † Current affiliation: Tokyo Institute of Technology fore, various data augmentation methods for lowresource parallel corpora have been studied. For instance, the generation of pseudo data was proposed by back-translating the monolingual corpora or paraphrasing the parallel corpora as additional training data (Wang et al., 2018;Sennrich et al., 2016;Li et al., 2019). Hence, this study proposes a data augmentation method that can be effective in long sentence translations. The proposed method is illustrated in Figure 1. Long sentences were obtained by concatenating two sentences at random and adding them to the original data. The translation quality is expected to be improved by this method because the low quality of translation of long sentences was caused by insufficient number of long sentences in the training data, which reduces this concern in the proposed method.
This study presents an improved BLEU score and higher quality in long sentence translations on English-Japanese corpus. Moreover, the BLEU score further increases by incorporating backtranslation. In addition, human evaluation shows that fluency is increased more than adequacy.
In summary, the main contributions of this paper are as follows: • We propose a simple yet effective data augmentation method, involving sentence concatenation, for long sentence translation.
• We show that the translation quality can be further improved by combining back-translation and sentence concatenation.

Related Works
NMT exhibits a significant decrease in the translation quality for very long sentences. Koehn and Knowles (2017) analyzed the correlation between the translation quality and the sentence length by comparing NMT with SMT. They showed that the overall quality of NMT is better than that of SMT but that SMT outperforms NMT on sentences of 60 words and longer. They stated that this degradation in quality was caused by the short length of the translations. Additionally, Neishi and Yoshinaga (2019) propose to use the relative position information instead of the absolute position information to mitigate the performance drop of NMT models for long sentences. They conducted an analysis of the translation quality and sentence length on lengthcontrolled English-to-Japanese parallel data and showed that the absolute positional information sharply drops the BLEU score of the transformer model (Vaswani et al., 2017) in translating sentences that are longer than those in the training data.
Several data augmentation methods have been proposed for NMT, such as back-translation, which involves translating the target-side monolingual data to create a pseudo dataset (Sennrich et al., 2016). In their method, the back-translation model is first learned by using parallel corpora from the target-side to the source-side. Once converged, this model generates pseudo data by translating the target-side monolingual corpora to the source-side language. A translation model is then trained using both the pseudo-parallel and original-parallel data. Li et al. (2019) analyzed multiple data augmentation methods. In their experiments, they applied self-training and back-translation. In selftraining, they fixed the source-side and used a forward translation model to generate the target-side, and in back-translation, they fixed the target-side and used a backward translation model to generate the source-side. It was observed that these methods can effectively improve the translation accuracy for infrequent tokens. These methods can be used with the sentence concatenation method proposed in this study.
In multi-source neural machine translation, Dabre et al. (2017) proposed concatenating source sentences in different languages corresponding to a target sentence in training. However, they did not aim to improve the translation accuracy of long sentences. Our method concatenates two source sentences in the same language at random.

Data Augmentation by Sentence Concatenation
The proposed method augments the parallel data by back-translation and concatenation. A schematic overview of the proposed method is shown in Figure 1. First, we back-translate the target-side of the parallel corpus (Li et al., 2019;Sennrich et al., 2016) to create pseudo data as additional training data. Note that we do not use external data in backtranslation, and the diversity of target sentences does not change.
Then, we randomly select two sentences exclusively in the original or pseudo data and concatenate them to create another training data. Technically, we concatenate two source sentences and insert a special token, "<sep>," between them. Corresponding target sentences are concatenated in the same way. Afterwards, we remove the sentences  Table 1: BLEU scores for each sentence length breakdown on the test data set: "vanilla + BT + concat" consists of data from vanilla, BT, and concatenation of both. consisting of less than 25 words from the pseudo data. Finally, we obtain an augmented training data comprising original, pseudo, and concatenated sentences, which has the quadruple data size of the original training data.
We train our models on both single and concatenated sentences first because models can learn to translate single sentences. We also expect models to acquire a better absolute position encoding to translate long sentences in the better quality without generating a special token (i.e., <sep>) contained in concatenated sentences in the inference process.
During the testing process, a single sentence is fed as the input, even though the training data contains concatenated sentences. 1

Models
To investigate the effectiveness of the proposed method when combined with previous data aug- 1 We also conducted an experiment with two sentences as input during the test, but the BLEU score was worse than the proposed method. mentation methods, five types of training data were prepared from the original training data. Figure 2 shows the number of training data used in this experient. Note that the total number of sentences in "vanilla + concat," "vanilla + ST" and "vanilla + BT" are nearly equal. In the source language, the average sentence length of "vanilla" is 30.39, and that of "vanilla+concat" is 46.18.
We train the forward translation models using the training data and compare the BLEU scores obtained for the output of the test data.
vanilla. Original data.
vanilla + concat. Original data and augmented data by sentence concatenation. Sentences with length of less than 25 words after concatenation were removed to improve the translation quality of long sentences.
vanilla + ST. Original data and augmented data by self-training.
vanilla + BT. Original data and augmented data by back-translation.
vanilla + BT + concat. The composite data of the original data, the back-translated data, and their sentence concatenation. 2

Setup
We used ASPEC 3 from WAT17 (Nakazawa et al., 2017) to perform English-to-Japanese translation. This dataset contains 2M sentences as training data, 1,790 as valid data and 1,812 as test data. We also followed the official segmentation using Sentence-Piece (Kudo and Richardson, 2018) Table 2: Human evaluation: Pairwise comparison of "vanilla + BT" and "vanilla + BT + concat." "win" denotes the sentence generated by our proposed method, "vanilla + BT + concat," is superior to that of "vanilla + BT," and "lose" denotes the opposite of "win." Figure 3: Effectiveness of the proposed method for each data size by sentence length: Vertical axis represents BLEU score of "vanilla + concat + BT" minus BLEU score of "vanilla + BT." data and selected as the training data to be used in this experiment. Regarding self-training and back-translation models, we used only the training corpus, following Li et al. (2019). The transformer models from Fairseq were used in the experiment (Ott et al., 2019) 4 . Adam was set as the optimizer with a dropout of 0.3, a maximum of 300,000 steps in the training process, and a total batch size of approximately 65,536 tokens per step. The same architecture was also used to train the self-training and the back-translation models.
The BLEU score (Papineni et al., 2002) was used for automatic evaluation. We computed the average of the BLEU scores of three runs with different seeds. Human evaluation was also conducted. For three native Japanese evaluators, 100 sentences were randomly selected from the test set per evaluator. They performed pairwise evaluation between "vanilla + BT" and "vanilla + BT + concat" from two perspectives: adequacy and fluency.

Results
Automatic evaluation. The result of this experiment is presented in Table 1. It describes the BLEU scores measured for each test data classified by the sentence length. The BLEU score of "vanilla + concat" is more stable when applied for translation with sentence lengths of longer than 51 words, which are the majority of data augmented by the sentence concatenation, although the score for the sentences classified as 61-70, is slightly lower than that of "vanilla." Conversely, the quality of the translation of short sentences is greatly reduced.
Additionally, the overall score of "vanilla + BT + concat" is higher than that of "vanilla + BT" by 0.6. In particular, the score of the sentence lengths of longer than 41 is significantly improved, which indicates that the proposed method is more effective for long sentence translation. In Addition, the score of "vanilla + BT + concat" is much higher than that of "vanilla + concat." Consequently, it is shown that the back-translation and concatenation src Myanma is behind in market economization together with Laos, Canbodia, Vietnam, and the GDP per one person is the lowest in the 4 countries, and it remains $ 180, but Myanma is thought to remarkably develop if political problems are solved, because flatland occupies 7 × 10% of the land and natural resources are rich, and because personnel expenses are extremely cheap. are independent factors that improve the accuracy of the translation.
Human evaluation. Table 2 presents the results of human evaluation. We observed that the output of the proposed method improved or were comparable under almost all conditions except for "11-20" on adequacy and "41-50" on fluency. The proposed method added the sentences whose length is more than 25 words and is effective in improving the translation of such sentences.

Discussion
Test set. Figure 3 depicts the breakdown in the difference between the BLEU scores of the proposed method for each training data size per sentence length. Notably, for sentences with 51 words or longer, the translation quality improves when the size of data is between 400K and 800K. However, the translation quality degrades when there are more than 1M sentences. The proposed method is not suitable when a large amount of training data is available.
In the human evaluation, we observe that the proposed method is more effective in terms of fluency than adequacy. It is assumed that the translation model can handle absolute positional encoding for long sentences by the proposed method.
Pseudo test set. In this experiment, the number of bilingual sentences in the test set was small, especially in long sentences. For this reason, additional experiments were carried out to confirm the validity of the results. For evaluation, we extracted 1M sentences from the training data that were not used for training and used them as the pseudo test data. Table 3 shows the average of the BLEU scores for the three runs with 400K training data with differ-ent seeds. Note that the overall BLEU score is, however, lower than when using the test data, but this is probably because the quality of the training data is lower than that of the test data.
By comparing the results of "vanilla + BT" and that of the proposed method, the proposed method was shown to have a slightly better overall score. Examining the scores by sentence length, there was a significant increase in scores for longer sentences, especially for "101 -200" sentences. It indicates that the proposed method is effective in improving the translation accuracy of long sentences.
Also, a comparison similar to the one using the test set was conducted using this 1M pseudo test data. The results are shown in Figure 4. In this setting, it is more evident that for sentences with a sentence length of 51 words or more, the translation accuracy improves when the data size is 800K or less and decreases when the data size exceeds 1M.

Case Study
Tables 4 and 5 show the cases in which the proposed method worked effectively in this experiment, whereas Table 6 shows the cases in which the translation quality deteriorated.
The example in Table 4 shows that the sentence output by "vanilla" is shorter than expected, which indicates that necessary information for translation is missing. Conversely, the output of "vanilla + concat" is a longer sentence, which reduces the missing information.
The example in Table 5 shows an example of improved translation by using the proposed method. Similar to the previous example, "vanilla + BT" completely loses the information in the first half of the sentence, while "vanilla + BT + concat" produces a translation that includes the information of src Results of the analysis shows high accuracy properties, such as the reproducibility of relative standard deviation 0.3~0.9% varified by repetitive analyses of ten times, the clibration curves with correlation coefficient of 1 verified by tests of standard materials in using six kinds of acetonitrile dilute solutions, and the formaldehyde detection limit of 0.0018µg/mL.  the entire sentence. However, as shown in the example in Table 6, there were cases where the output of the model trained including concatenated data showed repetitive outputs that were not seen in the output of the model trained on the original data. This type of output occurs more frequently in the case of short sentences. This suggests that the ability to output long sentences may lead to unnatural repetition of the output because of the attempt to generate long sentences.

Conclusion
This study proposes a data augmentation method to improve the translation quality of long sentences. The experimental results confirmed that the data augmentation method is straightforward but useful, especially for the translation of very long sentences. However, the quality of the translation of short sentences is reduced.
In the future, we would like to develop a method that works well when there is a large amount of available parallel data. Moreover, since the adequacy of the translation of short sentences is considerably low in the proposed method, we would like to compensate for this weakness by considering the reconstruction loss (Tu et al., 2017). Also, it would be interesting to explore the use of interpolation of hidden space for data augmentation considering long sentences (Chen et al., 2020).