Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).


Introduction
The segmentation of rare words into subword units (Sennrich et al., 2016;Wu et al., 2016) has become a critical component of neural machine translation (Vaswani et al., 2017) and natural language understanding (Devlin et al., 2019). Subword units enable open vocabulary text processing with a negligible pre-processing cost and help maintain a desirable balance between the vocabulary size and decoding speed. Since subword vocabularies are built in an unsupervised manner (Sennrich et al., 2016;Wu et al., 2016), they are easily applicable to any language.
Given a fixed vocabulary of subword units, rare words can be segmented into a sequence of subword units in different ways. For instance, "un+conscious" and "uncon+scious" are both suitable segmentations for the word "unconscious". This paper studies the impact of subword segmentation on neural machine translation, given a fixed subword vocabulary, and presents a new algorithm called Dynamic Programming Encoding (DPE).
We identify three families of subword segmentation algorithms in neural machine translation: 1. Greedy algorithms: Wu et al. (2016) segment words by recursively selecting the longest subword prefix. Sennrich et al. (2016) recursively combine adjacent word fragments that co-occur most frequently, starting from characters. 2. Stochastic algorithms (Kudo, 2018;Provilkov et al., 2019) draw multiple segmentations for source and target sequences resorting to randomization to improve robustness and generalization of translation models. 3. Dynamic programming algorithms, studied here, enable exact marginalization of subword segmentations for certain sequence models. We view the subword segmentation of output sentences in machine translation as a latent variable that should be marginalized out to obtain the probability of the output sentence given the input. On the other hand, the segmentation of source sentences can be thought of as input features and can be randomized as a form of data augmentation to improve translation robustness and generalization. Unlike previous work, we recommend using two distinct segmentation algorithms for tokenizing source and target sentences: stochastic segmentation for source and dynamic programming for target sentences.
We present a new family of mixed charactersubword transformers, for which simple dynamic programming algorithms exist for exact marginalization and MAP inference of subword segmenta-tions. The time complexity of the dynamic programming algorithms is O(T V ), where T is the length of the target sentence in characters, and V is the size of the subword vocabulary. By comparison, even computing the conditional probabilities of subword units in an autoregressive model requires O(T V ) to estimate the normalizing constant of the categorical distributions. Thus, our dynamic programming algorithm does not incur additional asymptotic costs. We use a lightweight mixed character-subword transformer as a means to pre-process translation datasets to segment output sentences using DPE for MAP inference.
The performance of a standard subword transformer (Vaswani et al., 2017) trained on WMT datasets tokenized using DPE is compared against Byte Pair Encoding (BPE) (Sennrich et al., 2016) and BPE dropout (Provilkov et al., 2019). Empirical results on English ↔ (German, Romanian, Estonian, Finnish, Hungarian) suggest that stochastic subword segmentation is effective for tokenizing source sentences, whereas deterministic DPE is superior for segmenting target sentences. DPE achieves an average improvement of 0.9 BLEU over greedy BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over stochastic BPE dropout (Provilkov et al., 2019) 1 .

Related Work
Neural networks have revolutionized machine translation (Sutskever et al., 2014;Bahdanau et al., 2015;Cho et al., 2014). Early neural machine translation (NMT) systems used words as the atomic element of sentences. They used vocabularies with tens of thousands words, resulting in prohibitive training and inference complexity. While learning can be sped up using sampling techniques (Jean et al., 2015), word based NMT models have a difficult time handling rare words, especially in morphologically rich languages such as Romanian, Estonian, and Finnish. The size of the word vocabulary should increase dramatically to capture the compositionality of morphemes in such languages.
More recently, many NMT models have been developed based on characters and a combination of characters and words (Ling et al., 2015;Luong and Manning, 2016;Vylomova et al., 2017;Lee et al., 2017;Cherry et al., 2018). Fully character based models (Lee et al., 2017;Cherry et al., 2018) demonstrate a significant improvement over word 1 code and corpora: https://github.com/xlhex/dpe based models on morphologically rich languages. Nevertheless, owing to the lack of morphological information, deeper models are often required to obtain a good translation quality. Moreover, elongated sequences brought by a character representation drastically increases the inference latency.
In order to maintain a good balance between the vocabulary size and decoding speed, subword units are introduced in NMT (Sennrich et al., 2016;Wu et al., 2016). These segmentation approaches are data-driven and unsupervised. Therefore, with a negligible pre-processing overhead, subword models can be applied to any NLP task (Vaswani et al., 2017;Devlin et al., 2019). Meanwhile, since subword vocabularies are generated based on word frequencies, only the rare words are split into subword units and common words remain intact.
Previous work (Chan et al., 2016;Kudo, 2018) has explored the idea of using stochastic subword segmentation with multiple subword candidates to approximate the log marginal likelihood. Kudo (2018) observed marginal gains in translation quality at the cost of introducing additional hyperparameters and complex sampling procedures. We utilize BPE dropout (Provilkov et al., 2019), a simple stochastic segmentation algorithm for tokenizing source sentences.
Dynamic programming has been used to marginalize out latent segmentations for speech recognition (Wang et al., 2017), showing a consistent improvement over greedy segmentation methods. In addition, dynamic programming has been successfully applied to learning sequence models by optimizing edit distance (Sabour et al., 2018) and aligning source and target sequences (Chan et al., 2020;Saharia et al., 2020). We show the effectiveness of dynamic programming for segmenting output sentences in NMT using a mixed character-transformer in a pre-processing step.

Latent Subword Segmentation
Let x denote a source sentence and y = (y 1 , . . . , y T ) denote a target sentence comprising T characters. The goal of machine translation is to learn a conditional distribution p(y | x) from a large corpus of source-target sentences. State-ofthe-art neural machine translation systems make use of a dictionary of subword units to tokenize the target sentences in a more succinct way as a sequence of M ≤ T subword units. Given a subword vocabulary, there are multiple ways to segment a rare word into a sequence of subwords (see Figure 1). The common practice in neural machine translation considers subword segmentation as a pre-process and uses greedy algorithms to segment each word across a translation corpus in a consistent way. This paper aims to find optimal subword segmentations for the task of machine translation.
Autoregressive language models create a categorical distribution over the subword vocabulary at every subword position and represent the logprobability of a subword sequence using chain rule, (1) Note that we suppress the dependence of p on x to reduce notational clutter. Most neural machine translation approaches assume that z is a deterministic function of y and implicitly assume that log p(y, z) ≈ log p(y).
We consider a subword segmentation z as a latent variable and let each value of z ∈ Z y , in the set of segmentations compatible with y, contribute its share to p(y) according to p(y) = z p(y, z), (2) Note that each particular subword segmentation z ∈ Z y provides a lower bound on the log marginal likelihood log p(y) ≥ log p(y, z). Hence, optimizing (1) for a greedily selected segmentation can be justified as a lower bound on (2). That said, optimizing (2) directly is more desirable. Unfortunately, exact marginalization over all segmentations is computationally prohibitive in a combinatorially large space Z y , especially because the probability of each subword depends on the segmentation of its conditioning context. In the next section, we discuss a sequence model in which the segmentation of the conditioning context does not influence the probability of the next subword. We describe an efficient Dynamic Programming algorithm to exactly marginalize out all possible subword segmentations in this model.

A Mixed Character-Subword Transformer
We propose a mixed character-subword transformer architecture, which enables one to marginalize out latent subword segmentations exactly using dynamic programming (see Figure 2). Our key insight is to let the transformer architecture process the inputs and the conditioning context based on characters to remain oblivious to the specific choice of subword segmentation in the conditioning context and enable exact marginalization. That said, the output of the transformer is based on subword units and at every position it creates a categorical distribution over the subword vocabulary. More precisely, when generating a subword y z i ,z i+1 , the model processes the conditioning context (y z 1 , . . . , y z i ) based solely on characters using, where the dependence of p on x is suppressed to reduce notational clutter.
Given a fixed subword vocabulary denoted V , at every character position t within y, the mixed character-subword model induces a distribution over the next subword w ∈ V based on, w ∈V exp(f (y 1 , .., y t ) e(w )) where f (·) processes the conditioning context using a Transformer, and e(·) represents the weights of the softmax layer.

Algorithm 1 Dynamic Programming (DP) for Exact Marginalization
Input: y is a sequence of T characters, V is a subword vocabulary, m is the maximum subword length Output: log p(y) marginalizing out different subword segmentations.
1: α 0 ← 0 2: for k = 1 to T do 3: ., y j ) 4: end for 5: return α T the marginal probability log p(y) = log z∈Zy p(y, z) Figure 2: An illustration of the mixed charactersubword Transformer. The input is a list of characters, whereas the output is a sequence of subwords.
As depicted in in Figure 2, the mixed charactersubword Transformer consumes characters as input generates subwords as output. This figure only shows the decoder architecture, since as the encoder that processes x is a standard subword Transformer. Once a subword w is emitted at time step t, the characters of the subword w are fed into the decoder for time steps t + 1 to t + |w|, and the next subword is generated at time step t + |w|, conditioned on all of the previously generated characters.

Optimization
The training objective for our latent segmentation translation model is (x,y)∈D log P θ (y|x) where D is the training corpus consisting of parallel bilingual sentence pairs. Maximizing the training objective requires marginalization and the computation of the gradient of the log marginal likelihood.
Exact Marginalization. Under our model, the probability of a subword only depends on the character-based encoding of the conditioning context and not its segmentation, as in (3). This means that we can compute the log marginal likelihood for a single example y, exactly, using the Dynamic Programming algorithm shown in Algorithm 1. The core of the algorithm is line 3, where the probability of the prefix string y 0,k is computed by summing terms corresponding to different segmentations. Each term consists of the product of the probability of a subword y j,k times the probability of its conditioning context (y 1 , . . . , y j ). The running time of the algorithm is O(mT ), where T is the length of the string, and m is the size of the longest subword unit in the vocabulary.
Gradient Computation. We use automatic differentiation in PyTorch to backpropagate through the dynamic program in Algorithm 1 and compute its gradient. Compared to a standard Transformer decoder, our mixed character-subword Transformer is 8x slower with a larger memory footprint, due to computation involved in the DP algorithm and large sequence length in characters. To address these issues, we reduce the number of transformer layers from 6 to 4, and accumulate 16 consecutive gradients before one update.

Segmenting Target Sentences
Once the mixed character-subword transformer is trained, it is used to segment the target side of a bilingual corpus. We randomize the subword segmentation of source sentences using BPE dropout (Provilkov et al., 2019). Conditional on the source sentence, we use Algorithm 2, called Dynamic Programming Encoding (DPE) to find a segmentation of the target sentence with highest posterior probability. This algorithm is similar to the marginalization algorithm, where we use a max operation instead of log-sum-exp. The mixed character-subword transformer is used only for tokenization, and a standard subword transformer is trained on the segmented sentences. For inference using beam search, the mixed character-subword transformer is not needed.

Algorithm 2 Dynamic Programming Encoding (DPE) for Subword Segmentation
Input: y is a sequence of T characters, V is a subword vocabulary, m is the maximum subword length Output: Segmentation z with highest posterior probability.

Experiments
Dataset We use WMT09 for En-Hu, WMT14 for En-De, WMT15 for En-Fi, WMT16 for En-Ro and WMT18 for En-Et. We utilize Moses toolkit 2 to pre-process all corpora, and preserve the true case of the text. Unlike Lee et al. (2018), we retain diacritics for En-Ro to retain the morphological richness. We use all of the sentence pairs where the length of either side is less than 80 tokens for. training. Byte pair encoding (BPE) (Sennrich et al., 2016) is applied to all language pairs to construct a subword vocabulary and provide a baseline segmentation algorithm. The statistics of all corpora is summarized in Table 1.
Training with BPE Dropout. We apply BPE dropout (Provilkov et al., 2019) to each mini-batch. For each complete word, during the BPE merge operation, we randomly drop a particular merge with a probability of 0.05. This value worked the best in our experiments. A word can be split into different segmentations at the training stage, which helps improve the BPE baseline.
DPE Segmentation. DPE can be used for target sentences, but its use for source sentences is not justified as source segmentations should not be marginalized out. Accordingly, we use BPE dropout for segmenting source sentences. That is, 2 https://github.com/moses-smt/mosesdecoder we train a mixed character-subword transformer to marginalize out the latent segmentations of a target sentence, given a randomized segmentation of the source sentence by BPE dropout. After the mixed character-subword transformer is trained, it is used to segment the target sentences as describe in section 4.2 for tokenization.
As summarized in Figure 3, we first train a mixed character-subword transformer with dynamic programming. Then, this model is frozen and used for DPE segmentation of target sentences. Finally, a standard subword transformer is trained on source sentences segmented by BPE dropout and target sentences segmented by DPE. The mixed charactersubword transformer is not needed for translation inference.
Transformer Architectures. We use transformer models to train three translation models on BPE, BPE dropout, and DPE corpora. We make use of transformer base for all of the experiments.   (Provilkov et al., 2019) and our DPE algorithm) on 10 different WMT datasets. For each language pair, all of the segmentation techniques use the same subword dictionary with 32K tokens shared between source and target languages. ∆ 1 shows the improvement of BPE dropout compared to BPE, and ∆ 2 shows further improvement of our proposed DPE method compared to BPE dropout.

Main Results
BPE source: Die G@@ le@@ is@@ anlage war so ausgestattet , dass dort elektr@@ isch betrie@@ bene Wagen eingesetzt werden konnten . DPE target: The railway system was equipped in such a way that electrical@@ ly powered cart@@ s could be used on it . BPE target: The railway system was equipped in such a way that elect@@ r@@ ically powered car@@ ts could be used on it .
BPE source: Normalerweise wird Kok@@ ain in kleineren Mengen und nicht durch Tunnel geschm@@ ug@@ gelt . DPE target: Normal@@ ly c@@ oca@@ ine is sm@@ ugg@@ led in smaller quantities and not through tunnel@@ s . BPE target: Norm@@ ally co@@ c@@ aine is sm@@ ugg@@ led in smaller quantities and not through tun@@ nels . BPE. This gain can be attributed to the robustness of the NMT model to the segmentation error on the source side, as our analysis in Section 5.3 will confirm. Second, we observe further gains resulted from DPE compared to BPE dropout. The column labeled ∆ 2 shows the improvement of DPE over BPE dropout. DPE provides an average improvement of 0.55 BLEU over BPE dropout and BPE dropout provides an average improvement of 0.35 BLEU over BPE. As our proposal uses BPE dropout for segmenting the source, we attribute our BLEU score improvements to a better segmentation of the target language with DPE. Finally, compared to BPE for segmenting the source and target, our proposed segmentation method results in large improvements in the translation quality, up to 1.49 BLEU score improvements in Et→En. Table 3 shows examples of target sentences segmented using DPE and BPE and the corresponding source sentences. In addition, Table 4 presents the top 50 most common English words that result in a disagreement between BPE and DPE segmentations based on the Et→En corpus. For DPE, for each word, we consider all segmentations produced and show the segmentation that attains the highest frequency of usage in Table 4. As can be observed, DPE produces more linguistically plausible morpheme-based subwords compared to BPE. For instance, BPE segments "carts" into "car"+"ts", as both "car" and "ts" are common subwords and listed in the BPE merge table. By contrast DPE segments "carts" into "cart"+"s". We attribute the linguistic characteristics of the DPE segments to the fact that DPE conditions the segmentation of a target word on the source sentence and the previous tokens of the target sentence, as opposed to BPE, which mainly makes use of frequency of subwords, without any context. DPE generally identifies and leverages some linguistic properties, e.g., plural, antonym, normalization, verb tenses, etc. However, BPE tends to deliver less linguistically plausible segmentations, possibly due to its greedy nature and the lack of context. We believe this phenomenon needs further investigation, i.e., the contribution of source vs. target context in DPE segmentations, and a quantitative evaluation of linguistic nature of word fragments produced by DPE. We will leave this to future work.

Analysis
Conditional Subword Segmentation. One of our hypothesis for the effectiveness of subword segmentation with DPE is that it conditions the segmentation of the target on the source language. To verify this hypothesis, we train mixed charactersubword Transformer solely on the target language sentences in the bilingual training corpus using the language model training objective. This is in contrast to the mixed character-subword model used in the DPE segmentation of the main results in Table  2, where the model is conditioned on the source language and trained on the sentence pairs using a conditional language model training objective. Once the mixed character-subword Transformer language model is trained, it is then used to segment the target sentence of the bilingual corpus in the pre-processing step before a translation model is trained. Table 5 shows the results. It compares the unconditional language model (LM) DPE vs the conditional DPE for segmenting the target language, where we use BPE dropout for segmenting the source language. We observe that without the information from the source, LM DPE is on-par to BPE, and is significantly outperformed by conditional DPE. This observation confirms our hypothesis that segmentation in NMT should be source-dependent.
We are further interested in analyzing the differences of the target language segmentation depending on the source language. For this analysis,   Another aspect of DPE segmentation method is its dependency on the segmentation of the source. As mentioned, we segment the target sentence on the fly using our mixed character-subword model given a randomized segmentation of the source produced by BPE dropout. That means during the training of the NMT model where we use BPE dropout for the source sentence, the corresponding target sentence may get a different DPE segmentation given the randomized segmentation of the source sentence. We are interested in the effectiveness of the target segmentation if we commit to a fixed DPE segmentation conditioned on the BPE segmentation of the input. Table 6 shows the results. We observe that there is a marginal drop when using the fixed DPE, which indicates that the encoder can benefit from a stochastic segmentation, while the decoder prefers a deterministic segmentation corresponding to the segmentation of the source.

DPE vs BPE.
We are interested to compare the effectiveness of DPE versus BPE for the target, given BPE dropout as the same segmentation   method for the source. Table 7 shows the results. As observed, target segmentation with DPE consistently outperforms BPE, leading to up to .9 BLEU score improvements. We further note that using BPE dropout on the target has a similar performance to BPE, and it is consistently outperformed by DPE.
We further analyze the segmentations produced by DPE vs BPE. Figure 5 shows the percentage of the target words which have different segmentation with BPE and DPE, for different word frequency bands in En→Et translation task. We observe that for Estonian words whose occurrence is up to 5 in the training set, the disagreement rate between DPE and BPE is 64%. The disagreement rate decreases as we go to words in higher frequency bands. This may imply that the main difference between the relatively large BLEU score difference between BPE and DPE is due to their different segmentation mainly for low-frequency words.
We further plot the distribution of BLEU scores by the length of target sentences. As shown in Figure 6, DPE demonstrates much better gains on the longer sentences, compared with the BPE version.

Conclusion
This paper introduces Dynamic Programming Encoding in order to incorporate the information of the source language into subword segmentation of the target language. Our approach utilizes dynamic programming for marginalizing the latent segmentations when training, and inferring the highest probability segmentation when tokenizing. Our comprehensive experiments show impressive improvements compared to state-of-the-art segmentation methods in NMT, i.e., BPE and its stochastic variant BPE dropout.