Incorporating Noisy Length Constraints into Transformer with Length-aware Positional Encodings

Neural Machine Translation often suffers from an under-translation problem due to its limited modeling of output sequence lengths. In this work, we propose a novel approach to training a Transformer model using length constraints based on length-aware positional encoding (PE). Since length constraints with exact target sentence lengths degrade translation performance, we add random noise within a certain window size to the length constraints in the PE during the training. In the inference step, we predict the output lengths using input sequences and a BERT-based length prediction model. Experimental results in an ASPEC English-to-Japanese translation showed the proposed method produced translations with lengths close to the reference ones and outperformed a vanilla Transformer (especially in short sentences) by 3.22 points in BLEU. The average translation results using our length prediction model were also better than another baseline method using input lengths for the length constraints. The proposed noise injection improved robustness for length prediction errors, especially within the window size.


Introduction
In autoregressive Neural Machine Translation (NMT), a decoder generates one token at a time, and each output token depends on the output tokens generated so far. The decoder's prediction of the end of the sentence determines the length of the output sentence. This prediction is sometimes made too earlybefore all of the input information is translated-causing a so-called under-translation.
Transformer has sinusoidal positional encoding to incorporate the token position information in the sequence into its encoder and decoder (Vaswani et al., 2017). There are some previous studies for controlling an output length in Transformer. Takase and Okazaki (2019) proposed two variants of length-aware positional encodings called length-ratio positional encoding (LRPE) and length-difference positional encoding (LDPE) to control the output length based on the given length constraints in automatic summarization. Lakew et al. (2019) applied LDPE and LRPE to NMT. They trained an NMT model using output length constraints based on LDPE and LRPE along with special tokens representing length ratio classes between input and output sentences, while they used the input sentence length at the inference time. However, the length of an input sentence is not a reliable estimator of the output length, because the actual output length varies with the content of the input.
Using length constraints in the decoder is a promising approach to the under-translation problem. We propose an NMT method based on LRPE and LDPE with a BERT-based output length prediction. The proposed method adds noise to the output length constraints in training to improve its robustness against the possible length variances in the translation. In our experiments with an English-to-Japanese dataset, the BERT-based output length prediction outperformed the use of the input length, and the proposed method, including noise injection into the training-time length constraints, improved the translation performance in BLEU for short sentences. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.

Positional encoding for control output length
The following is the sinusoidal positional encoding (PE) proposed by Vaswani et al. (2017): where len is the given output sequence length. LRPE and LDPE are expected to generate sentences of any length even if sentences of an exact length are not included in the training data. Takase and Okazaki (2019) used character-based lengths for summarization constraints in the number of characters.

Proposed method
The Transformer-based model with LRPE and LDPE generates a sequence that almost matches the given length. This characteristic is not always appropriate in the problem of machine translation because some translation variants have different lengths. In this paper, we incorporate random noise to the length constraints used by LRPE and LDPE during training to improve the robustness for such length variants. We also propose using an output length prediction based on BERT. Noise injection is expected to improve the robustness against possible length prediction errors.

Random noise injection to LRPE/LDPE length constraints
The existing studies that used LRPE and LDPE used the exact output lengths as length constraints (len in Eqs. 2 and 3) in training. We introduce some random noise into the output lengths in the number of tokens. The noise is given as a random integer from a uniform distribution within a window, such as [−2, 2]. For example, we randomly chose an integer from [−2, −1, 0, 1, 2]. Although perhaps the same positional encoding vectors might appear in a different position when a negative value is applied as noise, we ignore such cases in this work for simplicity.

BERT-based output length prediction
We need length estimates when we use LRPE and LDPE in inferences. Instead of using the input lengths like Lakew et al. (2019), we propose using an output length prediction based on a pre-trained BERT model in the source language. We used the [CLS] vector in the last layer of the BERT encoder to predict the output length through an output layer as a regression problem.

Experiments
To investigate the performance of our proposed method, we conducted English-to-Japanese translation experiments between a vanilla Transformer and its variants with LRPE and LDPE, implemented using OpenNMT (Klein et al., 2017).

Setup
Datasets We used the Japanese-English portion of the ASPEC corpus (Nakazawa et al., 2016), which consists of 3 million parallel sentences for training, 1,790 sentences for development, 1,784 sentences for the devtest, and 1,812 sentences for the test. All the sentences were tokenized into subwords using a SentencePiece model (Kudo and Richardson, 2018) with a shared subword vocabulary of 16,000 entries, which were trained with 2M English and Japanese sentences that included the first set of the training sentences pairs (train-1). Throughout the experiments, we used subword-based lengths.
Hyperparameters Our hyperparameter settings came from OpenNMT-py FAQ 1 and are used commonly for all the compared methods described later in this section. We conducted five independent training runs with different random seeds and chose the best runs and training epochs in the devtest set to determine the models for the final evaluation.
Evaluation We used BLEU (Papineni et al., 2002) for our evaluation metric given by multi-bleu.perl and also investigated the length ratio (LR) of the input and output sentences (LR = tgt len/ref len). BLEU was calculated on translation results re-tokenized by MeCab (Kudo, 2005) after merging the subwords. We also calculated the variance of the length difference between the translation results and the references (VAR) to investigate the effects of the output length constraints, following Takase and Okazaki (2019). The variance of the length differences on a test set consisting of n sentences is given by:

Compared methods
In addition to a vanilla Transformer (Vaswani et al., 2017), we compared three different windows for the length noise that was applied to LRPE and LDPE during training: the use of target lengths without noise, and the random noise within two different windows ([−2, 2], [−4, 4]) from a uniform distribution over the integers in the window. We compared two inference-time length constraints: the proposed BERTbased length prediction (BERT pred) and using the input length (src len). We also tested the reference lengths (ref len) to investigate the upper-bound performance by the proposed method. Table 1 shows the results of the BLEU, length ratio, and variance by the compared methods.

Results
BLEU The proposed method with the BERT-based length prediction (BERT pred) and an LDPE with a length noise window of [−4, 4] resulted in a slightly better BLEU score (38.80) than the baseline Transformer (38.42), but the difference was not statistically significant by the bootstrap resampling test.
On the other hand, the simple application of LDPE and LRPE resulted in a much lower BLEU score even with the correct reference lengths. This result suggests that the proposed training framework with some noise in the length constraints improved the robustness for length variances. The use of input length (src len) instead of the output length prediction significantly decreased BLEU in most cases, although the random noise injection provided some improvements and even competitive performance, as shown in the bottom row. The oracle results with reference lengths (ref len) were better than the baseline and the proposed method, but the BLEU differences were not significant.
Length ratio The length ratios by LDPE and LRPE with reference lengths clearly show that the length constraints induced more extended outputs than the vanilla Transformer, and the noise injection slightly shortened the outputs. Using the BERT-based output length prediction resulted in shorter outputs than the reference length due to the length prediction errors, as discussed below. On the other hand, the input lengths (src len) induced 20% longer outputs than the references without noise injection. Such over-translation was reduced by the noise injection, although it remained longer in all the cases.
Variance The length error variances showed that LDPE and LRPE induced outputs in closer lengths to the references than the vanilla Transformer. A wider noise window increased the variances, as shown in the rightmost column (ref len), although such differences became smaller when we used length prediction (BERT pred), possibly due to the length prediction errors. Using the input lengths resulted in much more significant variances due to relatively weak correlations between the input and output sentences.  Table 2: Average error, variance and the Pearson correlation coefficient between the input and reference lengths, and between the predicted and reference lengths (in the number of tokens) in ASPEC dataset Output length prediction We compared the differences of our length prediction (BERT pred) and the input lengths (src len) with the corresponding reference length (ref len) in the mean absolute error and error variances. The mean absolute error by the proposed method was 3.00, and the variance was 19.92, suggesting the BERT-based length prediction sometimes made serious length prediction errors, although it worked well in most cases. The proposed noise injection covered these relatively small errors tested in the experiments. On the other hand, the mean absolute error and variance by the substituted use of the input length were much more significant: 6.55 and 72.45, respectively. Such differences negatively affect translation results. Table 2 shows the average and variance of the length prediction errors by the use of the input sentence length, together with the Pearson correlation coefficient. Moreover, Table 2 shows those results by the proposed BERT-based length prediction. They clearly show the use of the input length as a proxy for the output length constraints was not suitable in this experiment.
Analysis in length groups We scrutinized the results with different length groups to investigate the effects of the length noise injection, because the length constraints probably have more significant impact on shorter sentences and vice versa. Note that we excluded the longest length group that exceeded 80 tokens because it includes three sentences and serious length errors. In Table 3, the proposed method with LDPE with a noise window of [−2, 2] significantly outperformed the vanilla Transformer by 3.22 points (50.81 vs. 47.59) in BLEU in the shortest length group with one to ten tokens. The other setups showed better BLEU results than the vanilla Transformer, although the differences were not statistically significant. Another clear finding is that the vanilla Transformer generated very short translation results for long sentences, as shown in the rightmost column; LDPE and LRPE created longer outputs. This finding is helpful for avoiding under-translation problems in NMT.

Conclusion
We proposed an NMT method using length-aware positional encodings with a training-time length noise injection and a BERT-based inference-time length prediction. The length noise injection improved the  robustness for translation length variations, including length prediction errors, especially for those within the noise window size. The experimental results show the effectiveness of the proposed method in short sentences. Our future work will pursue a more effective output length prediction to suppress overtranslations because we also found some over-translations in the test set, possibly caused by the length constraints.