Bilingual Subword Segmentation for Neural Machine Translation

This paper proposed a new subword segmentation method for neural machine translation, “Bilingual Subword Segmentation,” which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that of its translation. While existing subword segmentation methods tokenize a sentence without considering its translation, the proposed method tokenizes a sentence by using subword units induced from bilingual sentences; this method could be more favorable to machine translation. Evaluations on WAT Asian Scientific Paper Excerpt Corpus (ASPEC) English-to-Japanese and Japanese-to-English translation tasks and WMT14 English-to-German and German-to-English translation tasks show that our bilingual subword segmentation improves the performance of Transformer neural machine translation (up to +0.81 BLEU).


Introduction
Subword units have recently been widely used in neural machine translation (NMT) to solve open vocabulary problems. Byte Pair Encoding (BPE) (Sennrich et al., 2016) is a dominant subword segmentation method for NMT, but it is designed for segmented languages in which words are divided by spaces. Kudo (2018) has proposed a subword segmentation method based on a unigram language model, that can be applied to non-segmented languages such as Chinese and Japanese. Both BPE and the unigram language model tokenize sentences by minimizing the number of segments under a limitation on subword vocabulary size, which relies on a data compression principle. In these existing segmentations, a sentence is segmented without considering its translation, and therefore the segmented sentence might not be optimal for NMT. This paper proposes a new subword segmentation method for NMT, "Bilingual Subword Segmentation," which tokenizes sentences by using subword units induced from bilingual sentences. The proposed method is based on a unigram language model like Kudo (2018) because we aim to improve translation performance for non-segmented languages 1 . In particular, the proposed segmentation tokenizes bilingual sentences (i.e., training data for NMT) by selecting subword sequence pairs with similar numbers of segments, from segmentation candidates of the source and target language sentences obtained by a unigram language model. For segmentation of monolingual source language sentences (i.e., test data for NMT), an LSTM-based subword segmenter for the source language is preliminarily learned from the source side of segmented bilingual sentences, and monolingual source language sentences are tokenized by the learned subword segmenter.
Our bilingual segmentation encourages one-to-one mappings between segments across languages because it minimizes the difference between the number of segments in a sentence and that of its translation. As a result, subword units segmented by our bilingual segmentation can be expected to be more helpful for NMT than conventional subword units. For example, consider the situation in which two Japanese This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1 Although we focus on translation for non-segmented languages, we also examined and confirmed our proposed segmentation's effectiveness on translation for segmented languages (i.e., English-German), which is described in Section 5.6. compound words " (design method)" and " (measurement instrument)" occur many times in training data. When a conventional subword segmentation method is used, each must be merged into one subword unit because the conventional method minimizes the number of segments according to the data compression principle. As a result, these segments in training data are useless for translation of " (measurement method)". On the other hand, when the proposed segmentation method is used, these segments must be decomposed into " (design) (method)" and " (measurement) (instrument)", respectively, because our method puts the number of Japanese subword units near the number of English subword units (i.e., 2). Thus, segmented training data would be useful for the translation of " (measurement) (method)" because an NMT model can learn translations of " (measurement)" and " (method)" from constituents of " (measurement) (instrument)" and " (design) (method)", respectively. We evaluate the proposed subword segmentation method on WAT Japanese-to-English (Ja-En) and English-to-Japanese (En-Ja) translation tasks with the ASPEC (Nakazawa et al., 2016). We also evaluate our proposed method on WMT14 English-to-German (En-De) and German-to-English (De-En) translation tasks. These experiments show that the proposed subword segmentation method improves the performance of Transformer NMT (Vaswani et al., 2017) on all translation tasks (up to 0.81 point improvement in BLEU).

Subword Segmentation Based on the Unigram Language Model
This section describes the subword segmentation method based on the unigram language model (Kudo, 2018), that is the basis of our proposed segmentation method. The unigram language model assumes that each subword occurs independently and that the occurrence probability of a subword sequence P (x) is formulated as follows: where x = (x 1 , x 2 , . . . , x N ) is a subword sequence and V is a vocabulary set. Each subword occurrence probability p(x i ) is estimated by an EM algorithm that maximizes the following marginal likelihood L lm : where D is a parallel corpus, X (s) is the s th source or target language sentence of D, and S(X (s) ) is a set of subword candidates built from X (s) . The subword sequence with the highest occurrence probability is obtained by the following equation: where X is the input sequence. Note that k-best subword sequences can be obtained on the basis of probability P (x|X) calculated by the unigram model, and a sequence with higher probability tends to be shorter because the probability of a subword sequence is the product of each subword's likelihood. The unigram language model's advantages are that it can be learned from raw sentences and that it can tokenize sentences in a non-segmented language such as Chinese and Japanese without requiring a word segmenter.

Proposed Model: Bilingual Subword Segmentation
This section proposes "bilingual subword segmentation," which tokenizes sentences by using subword units induced from bilingual sentences. In particular, our proposed segmentation tokenizes sentences so k-best subword segmentations of f : B k (f )

Character-Based BiLSTM
Source Sentence: f Bilingual subword segmentation for test data Figure 1: Overview of bilingual subword segmentation as to minimize the difference between the number of a sentence's subword units and that of its translation. In segmentation of training data for NMT, a sentence can be tokenized while referring to its translation (i.e., a sentence in the other language). On the other hand, in segmentation of test data for NMT, translations cannot be given. To address the different situations, we propose a bilingual subword segmentation method for training data and test data, as illustrated in Figures 1(a) and (b), respectively. Note that our proposed method does not depend on an NMT model or a training method; therefore, it can be applied only by replacing a conventional subword segmentation with our proposed subword segmentation.

Segmentation for Training Data
In segmentation of training data D, the proposed bilingual subword segmentation tokenizes a bilingual sentence (f, e) ∈ D by sampling a subword sequence pair with similar numbers of segments, from segmentation candidates obtained by the unigram language model. The proposed subword segmentation first obtains k-best segmentation candidates of a source language sentence and its target language sentence, B k (f ) and B k (e), by using the unigram language model described in Section 2, and then finds the following subword segmentation pair (f ,ê) and outputs them as subword sequences of the bilingual sentence (f, e).
where len() is the function that returns the number of subword tokens, and f * /e * is the subword sequence with the highest probability (i.e., the best subword sequence by the unigram language model) of the source/target language sentence. Let v * denote the longer one of f * and e * .û is obtained by searching a subword sequence with the highest probability from subword candidates that have lengths closest to v * as follows:û An NMT model with our bilingual subword segmentation is trained from segmented training dataD = where each bilingual sentence is segmented by the bilingual segmentation method.

Segmentation for Test Data
In segmentation of a source language sentence f of test data, the sentence's translation (i.e., e) is unknown. To tokenize a sentence without its translation, our proposed method preliminarily trains a character-based bidirectional LSTM (BiLSTM) segmenter for the source language from the source side of training dataD segmented by our bilingual segmentation method (i.e., {f (s) } |D| s=1 ) (see Section 3.1). Then, a monolingual source language sentence f is tokenized by using the trained BiLSTM-based segmenter.
The character-based BiLSTM segmenter identifies subword boundaries of an n-character sequence c = (c 1 , c 2 , . . . , c n ). The structure of the segmenter is as follows: where Embedding() is a character embedding layer, z is the d-dimensional character embedded representation of c, BiLSTM() is a character-based BiLSTM layer, h is the hidden vectors of the BiL-STM, softmax() is a softmax function, b is the output of BiLSTM, and W ∈ R d×{0,1} is a parameter matrix that projects the dimension of h into the boundary tag dimension. Note that the vector b t = (b t,0 , b t,1 ) represents the probability distribution of whether c t is a subword's beginning point (b t,0 ) or not (b t,1 ). The character-based BiLSTM is trained by maximizing the following equation L segment for In a source language's sentence segmentation, the k-best subword segmentations B k (f ) of the input source language sentence f are first obtained by using the unigram language model; then, the segmentation score (i.e., score(f )) for each segmentation sequence of segmentation candidates (i.e., f ∈ B k (f )) is calculated by the learned character-based BiLSTM as follows: Finally, the subword sequence with the highest score is selected: An NMT model with our bilingual subword segmentation translates the segmented source language sentencef * . In our experiments, we compared the proposed "bilingual subword segmentation" with the "unigram language model" (Kudo, 2018). We also compared it with "subword regularization" (Kudo and Richardson, 2018), which is a training method based on multiple subword candidates obtained by the unigram language model. We used Sentencepiece 2 as the unigram language model implementation to obtain multiple subword candidates. We used the Transformer base (Vaswani et al., 2017) model as the NMT system for all experiments.
Data: We evaluated translation performance on WAT ASPEC Ja-En and En-Ja translation tasks 3 (Nakazawa et al., 2016). We set vocabulary size to 16,000 separately for each source and target language. We set batch size to 10,000 tokens. We used the first 1.5 million translation pairs of training data in training and preprocessed the dataset according to the data preparation process for the WAT baseline system 4 . The number of parallel sentence pairs in the development and test sets were 1,790 and 1,812, respectively.
Hyperparameters: For all NMT models, we used the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98. The learning rate was warmed up over the first 4,000 steps to a peak value of 5e-4; then, it was decreased proportionally to the inverse square root of the step number (Vaswani et al., 2017). All NMT models were trained for 100k updates. The dropout probability was set to 0.1. We used label smoothed cross entropy (Szegedy et al., 2016) for NMT and set label smoothing ϵ to 0.1. In decoding, we averaged the last 5 checkpoints for each 1,000 updates before the end of training. We used beam search with a beam size of 4 and length penalty α = 0.6 (Wu et al., 2016). In the proposed model, the hyperparameter k, the number of candidates obtained by a unigram language model, was tuned on development data and set to 5 (i.e., k = 5). We used the character-based BiLSTM with an embedding size d = 256 and 2 encoder layers. All parameters of a character embedding layer, BiLSTM encoder layers, and an output layer were uniformly initialized as [−0.1, 0.1]. We trained the character-based BiLSTM for 10 epochs using the Adam optimizer with β 1 = 0.9, β 2 = 0.98. The learning rate was set to 5e-4, the dropout probability was set to 0.1, and the batch size was set to 256 sentences.
In subword regularization, we used 1-best decoding, which translates a segment sequence with the highest score of the unigram language model for a fair comparison with our proposed method because an NMT model with our segmentation method translates one segmented sequence.
As shown in the table, our proposed model "BiSW" outperformed both baseline models, "Unigram LM" and "Subword Regularization," in both language directions. "BiSW" improved by 0.81 and 0.10 BLEU points against "Unigram LM" on Ja-En and En-Ja, respectively, and by 0.53 and 0.19 BLEU points against "Subword Regularization" on Ja-En and En-Ja, respectively. These statistically significant improvements demonstrate our bilingual subword segmentation's effectiveness.

Comparison with Oracle Bilingual Segmentation
We evaluated the segmentation performance of our character-based BiLSTM segmenter on test data through comparison to gold bilingual segmentations, which are obtained by applying our proposed method for training data, described in Section 3.1, to the test data with references (i.e., bilingual sentences). Table 2(a) shows that our character-based BiLSTM segmenter achieved high segmentation performance.
We also evaluated the performance of the oracle translation, which translates source language sentences with gold bilingual segmentations. Note that although references are used for segmenting source language sentences, they are not used in translation. The oracle translation performance provides an upper bound of our proposed method's performance. In Table 2(b), "Oracle" indicates translation using gold bilingual segmentations. As shown in the table, Oracle achieved higher performance than BiSW, but differences were small. In particular, BiSW decreased only by 0.10 and 0.20 BLEU points on Ja-En and En-Ja, respectively, perhaps because our character-based BiLSTM segmenter could achieve high segmentation performance, as shown in Table 2(a).

Necessity of Character-Based BiLSTM Segmenter
Our proposed method requires a character-based BiLSTM segmenter for segmentation of monolingual sentences (i.e., test data). To confirm the necessity of the character-based BiLSTM segmenter in testing, we evaluated translation performance when the NMT model trained from bilingual-subword-segmented training data translates the best subword sequence obtained by the unigram language model (i.e., f * ) without using the BiLSTM-based segmenter, denoted by "BiSW w/o BiLSTM." When segmenting training data of "BiSW w/o BiLSTM," v * is fixed to f * (i.e., source-side best segmentation by the unigram language model) to bridge the gap between training and testing. In particular, bilingual segmentation of "BiSW w/o BiLSTM" uses the best subword sequence obtained by the source-side unigram language model and searches only the target-side subword sequence with the length closest to the source-side sequence.
As shown in Table 3, translation performance decreases when the character-based BiLSTM segmenter   is not used. In particular, "BiSW w/o BiLSTM" decreases by 0.59 and 0.29 BLEU points on Ja-En and En-Ja against BiSW, respectively. These results indicate that bilingual optimization via searching only for target language sentences is not enough, and a bidirectional search between source and target languages and the character-based BiLSTM segmenter are needed for NMT when performance of the character-based BiLSTM segmenter is high, as shown in Section 5.1.

Examples of Bilingual Subword Segmentation
In this section, we discuss differences between subword units obtained by the conventional method, "Unigram LM," and those obtained by the proposed method.  Table 4(a), our segmentation method decomposed a sequence into subword units that can be mapped into the other side's subword units, an action that could be helpful to train an NMT model, while the conventional method merged them into one subword unit. Table 4 (b) shows examples of subword units in Ja-En test data, (i.e., Japanese sentences in test data). As shown in Table 4(b), also on test data, our segmentation method successfully decomposed sequences into subword units that could be handled easily by one-to-one translation even though our segmentation method for test data does not refer to the other side's sentences (i.e., English translations).

Sensitivity to Hyperparameter k
Our proposed model has the hyperparameter k. In this section, we evaluate the sensitivity of our proposed method to the hyperparameter k. Figure 2 shows translation performance with varied k on development data in the ASPEC Ja-En task. In particular, we evaluated the performance of our proposed model when  7.10 5.38 Table 6: Average of the difference between numbers of segments in bilingual sentences on the ASPEC Ja-En task k = [2, 10], 15, 20, 50, and 100.
As Figure 2 illustrates, although there were some exceptions, our proposed model's translation performance tends to improve until k exceeds 50.

Likelihood-based Bilingual Subword Segmentation
As Kudo (2018) has mentioned, "the unigram language model is reformulated as an entropy encoder that minimizes the total code length for the text. According to Shannon's coding theorem, the optimal code length for a symbol s is − log p s , where p s is the occurrence probability of s." From observation, we hypothesized a relationship between the number of segments and the likelihood of a subword sequence; therefore, we evaluated the bilingual subword segmentation method that selects subword sequence pairs on the basis of the likelihood obtained by the unigram language model rather than on the number of segments in the sentence. In particular, the likelihood-based method replaces len() in Equations 5-8 with − log P () calculated by the unigram language model so as to minimize the difference between the likelihood of a sentence and that of its translation.
In Table 5, "BiSW (# of segments)" and "BiSW (likelihood)" indicate the bilingual segmentation method based on number of segments and that based on likelihood, respectively. As shown in Table 5, the likelihood-based method outperforms the baseline model, Unigram LM, on Ja-En, but it is worse than the proposed method based on the number of segments in both language directions perhaps because, although the likelihood and the number of segments are related, they do not completely match, and the degree of a relationship might depend on the unigram language model's performance.
We calculated the average of the difference between a sentence's number of segments and that of its translation on training/test data. As in Table 6, differences in our two bilingual segmentation methods are smaller than the difference in the baseline Unigram LM on both training and test data; moreover, the proposed method based on number of segments has a smaller difference than the likelihood-based method.

Effectiveness for Segmented Language Pair
Although we focused on translation of non-segmented languages, we also examined our proposed method's effectiveness on a segmented language pair in this section. In particular, we evaluated our proposed method on WMT14 En-De and De-En translation tasks 6 .
In these tasks, we set vocabulary size to 37k with a joined dictionary. The source-side and targetside embedding layers of an NMT model were shared. We set batch size to 25k tokens. After subword segmentation, we removed from the training data sentences longer than 250 subword units and sentence pairs with a source/target length ratio exceeding 1.5. The hyperparameter k of our proposed method was set to 2, tuned on development data.   Table 7 shows evaluation results on WMT14 En-De and De-En tasks. Our proposed model "BiSW" was better than the baseline model "Unigram LM" in both language directions. In particular, "BiSW" improved by 0.32 and 0.02 BLEU points against "Unigram LM" on En-De and De-En, respectively. These results demonstrate that our proposed method is also effective for a segmented language pair.

Related Work
BPE (Sennrich et al., 2016) and the unigram language model (Kudo, 2018) are widely used as subword segmentation methods. BPE is a dictionary-based simple subword segmentation algorithm in which the most frequent adjacent character pairs are merged until they exceed the given vocabulary size. BPE is widely used in many NMT systems; however, since BPE is a greedy and deterministic algorithm, obtaining multiple subword candidates is not possible. The unigram language model is a likelihood-based subword segmentation algorithm. Each subword occurrence probability is estimated by the EM algorithm. The unigram language model has a more complicated algorithm than BPE, but it has the advantages that it can obtain multiple subword candidates based on likelihood and that it can be learned from raw sentences without pre-tokenization. Sentence-Piece (Kudo and Richardson, 2018) is an implementation of the unigram language model we used.
Subword regularization (Kudo, 2018) is an NMT training method that uses multiple subword candidates obtained by the unigram language model and maximizes the marginal likelihood of sampled multiple subword candidates. This method requires on-the-fly subword sampling in training; therefore, the training process for NMT needs to be modified to incorporate the method. In addition, a sufficiently large number of epochs is required to obtain this method's effectiveness. In contrast, our proposed method does not require changing the NMT training process and does not need a large number of epochs. BPE-dropout (Provilkov et al., 2020) is a method that extends BPE to use subword regularization. In this method, multiple subword candidates are obtained by probabilistically dropping merged characters. Note that BPE-dropout cannot obtain k-best candidates based on likelihood like P (x|X). Cherry et al. (2018) have shown that NMT that translates character sequences has achieved higher translation performance than word-based and subword-based NMT. However, they have mentioned that character-based NMT causes problems of modeling and computational time. We believe that our proposed method maintains balance between the advantages and disadvantages of character-based NMT (i.e., translation performance vs. modeling and computational cost). Ataman et al. (2017), Ataman and Federico (2018b), and Huck et al. (2017) have proposed linguisticbased subword segmentation algorithms. Ataman et al. (2017) and Ataman and Federico (2018b) have shown that their proposed "Linguistically Motivated Vocabulary Reduction (LMVR)," which is based on unsupervised morphology learning, outperforms BPE. Huck et al. (2017) have shown that incorporating linguistic knowledge, such as stemming and compound words, into subword segmentation improves NMT performance. Ataman and Federico (2018a) have further shown that compositional representations learned from character n-grams improve translation performance for morphologically-rich languages.

Conclusion
In this paper, we proposed a new subword segmentation method for NMT, "Bilingual Subword Segmentation," which tokenizes sentences by using subword units induced from bilingual sentences. Experiments on WAT ASPEC Ja-En and En-Ja tasks and WMT14 En-De and De-En translation tasks show that the proposed method improves Transformer NMT translation performance. Through experiments and discussions, we found that translation performance improves by tokenizing sentences so as to minimize the difference between the number of subword units of a sentence and that of its translation. In future work, we would like to confirm our proposed method's effectiveness for other language pairs.