Lexically Constrained Neural Machine Translation with Levenshtein Transformer

This paper proposes a simple and effective algorithm for incorporating lexical constraints in neural machine translation. Previous work either required re-training existing models with the lexical constraints or incorporating them during beam search decoding with significantly higher computational overheads. Leveraging the flexibility and speed of a recently proposed Levenshtein Transformer model (Gu et al., 2019), our method injects terminology constraints at inference time without any impact on decoding speed. Our method does not require any modification to the training procedure and can be easily applied at runtime with custom dictionaries. Experiments on English-German WMT datasets show that our approach improves an unconstrained baseline and previous approaches.


Introduction
Neural machine translation (NMT) systems can generate higher-quality translations than phrasebased MT systems, but they come at the cost of losing control over how translations are generated.Without the explicit link between the source and the target vocabulary, enforcing specific terminological translation in domain-specific settings becomes painfully difficult for NMT systems.Consider an example where we have a Chinese-English NMT system trained for the E-commerce domain, and there is no prior knowledge of the brand name "红 米" in the training data, the system would translate the input term literally as "red (红) rice (米)" instead of "Redmi".In such scenarios, machine translation users often maintain in-domain dictionaries to ensure that specific information is translated accurately and consistently.
A line of previous work that tried to address this problem required re-training the NMT models with lexical constraints, either by a placeholder mecha-nism (Crego et al., 2016) or via code-mixed training (Song et al., 2019;Dinu et al., 2019).However, they do not reliably guarantee the presence of the constraints at test time.Another approach focused on constrained beam search decoding (Hokamp and Liu, 2017;Post and Vilar, 2018;Hu et al., 2019).Although the latter approach has higher control over the target constraint terms, they significantly slow down the decoding.
Different from the existing line of work, we invoke lexical constraints using a non-autoregressive approach. 1To do this, we use Levenshtein Transformer (LevT) (Gu et al., 2019), an edit-based generation model that performs deletion and insertion operations during inference iteratively.LevT achieves substantially higher inference speed compared to beam search without affecting quality.We add a constraint insertion step in LevT decoding to seamlessly decode the target language sequence while adhering to specific lexical constraints, achieving the same speed as standard LevT decoding.

Related Work
Previous approaches integrated lexical constraints in NMT either via constrained training or decoding.Crego et al. (2016) replaced entities with placeholders that remained unchanged during translation and placed them back in a post-processing step.Song et al. (2019) trained a Transformer (Vaswani et al., 2017) model by augmenting the data to include the constraint target phrases in the source sentence.Dinu et al. (2019) proposed a similar idea and additionally used factored training.Other approaches proposed enforcement of lexical constraints during inference with various improvements to constraint-aware beam search, such as grid beam search (Hokamp and Liu, 2017), dynamic beam allocation (Post and Vilar, 2018), and its optimized vectorized version (Hu et al., 2019).Hasler et al. (2018) built finite-state acceptors to integrate constraints in a multi-stack decoder.These lexically-constrained decoding approaches rely on autoregressive inference that generates one target token at a time, which makes it difficult to parallelize the decoder and monotonically increases decoding time.While being mostly effective at forcing the inclusion of pre-specified terms in the output, these approaches further slow down the beam search process.Post and Vilar (2018) reported 3× slow down compared to standard beam search.
Non-autoregressive neural machine translation (NAT) (Gu et al., 2018) attempts to move away from the conventional autoregressive decoding.Such a direction enables parallelization during sequence generation that results in lower inference latency.Recent NAT approaches treat inference as an iterative refinement process, first proposed by Lee et al. (2018).Following this direction, it is intuitive to perform decoding using "edit" operations, such as insertion (Stern et al., 2019) or both insertion and deletion (LevT, Gu et al. (2019)).The LevT model has been shown to outperform existing refinement-based models, such as Ghazvininejad et al. (2019) and performs comparably to autoregressive Transformer models.Our method integrates lexical constraints in NAT decoding utilizing the flexibility, speed, and performance of LevT.

Levenshtein Transformer
Levenshtein Transformer (LevT) (Gu et al., 2019) has an encoder-decoder framework based on Transformer architecture (Vaswani et al., 2017) with multi-headed self-attention and feed-forward networks.Unlike token generation in a typical Transformer model, LevT decoder models a Markov Decision Process (MDP) that iteratively refines the generated tokens by alternating between the insertion and deletion operations.After embedding the source input through a Transformer encoder block, the LevT decoder follows the MDP formulation for each sequence at the k-th iteration y k = (y 1 , y 2 , ..., y n ), where y 1 and y n are the start (<s>) and end (</s>) symbols.The decoder then generates y k+1 by performing deletion and insertion operations via three classifiers that run sequentially:

Constraint Insertion Placeholder Classifier
Token Classifier <s> Nevada hat bereits ein Pilot@@ projekt abgeschlossen .</s> Deletion Classifier <s> </s> <s> Nevada Pilot@@ projekt </s> <s> Nevada 1. Deletion Classifier, which predicts for each token position whether they should be "kept" or "deleted", 2. Placeholder Classifier, which predicts the number of tokens to be inserted between every two consecutive tokens and then inserts the corresponding number of placeholder [PLH] tokens,

Token Classifier, which predicts for each
[PLH] token an actual target token.
Each prediction is conditioned on the source text and the current target text.The same Transformer decoder block is shared among the three classifiers.Decoding stops when the current target text does not change, or a maximum number of refinement iterations has been reached.The LevT model is trained using sequence-level knowledge distillation (Kim and Rush, 2016) from a Transformer teacher whose beam search output is used as ground truth during training.We refer the readers to (Gu et al., 2019) for a detailed description of the LevT model and training routine.

Incorporating Lexical Constraints
For sequence generation, the LevT decoder typically starts the first iteration of the decoding process with only the sentence boundary tokens y 0 = <s></s>.To incorporate lexical constraints, we populate the y 0 sequence before the first deletion operation with the target constraints, as shown in Figure 1.The initial target sequence will pass through the deletion, placeholder, and insertion classifiers sequentially, and the modified sequence will be refined for several iterations.The decoding steps are explained in detail below.
Constraint Insertion More formally, given a list of m target constraints C 1 , C 2 , ..., C m , where each constraint C i is possibly a multi-token phrase C i = w i 1 , w i 2 , ..., w i |C i | , we insert the constraints into the decoding sequence before the deletion operation to form Deletion Operation Next, y 0 passes through the deletion classifier to decide which w i j token to remove.If the deletion operation is allowed on the constraint tokens, the presence of each constraint in the final output is not guaranteed, especially when the supplied constraints are out of context for the decoder.To mitigate this problem, we optionally disallow the deletion operation on the constraint tokens by introducing a constraint mask to indicate the positions of constraint tokens in the sequence.We forcefully set the deletion classifier prediction for all positions in this mask to "keep".The positions in this mask are re-computed accordingly after each deletion and insertion operation.
Insertion Operation Finally, the y 0 passes through the placeholder classifier to predict the number of tokens to be inserted and generate the corresponding number of [PLH] tokens and the token classifier assigns an actual target token for every [PLH] token.Each constraint may contain multiple tokens, and the [PLH] tokens may be inserted between the tokens from the same constraint.To prevent this from happening and to keep each constraint intact, we optionally prohibit inserting [PLH] within a multi-token constraint by constraining 0 to the number of such placeholders.
In Figure 1, our constraint insertion is executed at the first pass, and subsequent iterations start from deletion (indicated by a loop in the figure).We note that this step happens only at inference; during training, the original LevT training routine is carried out without the constraint insertion.

Experiments
We extend the FAIRSEQ 2 (Ott et al., 2019)  The LevT model is trained using knowledge distillation routine using Transformer base output released by Gu et al. (2019).We leave more experimental details in the Appendix.

Data and evaluation settings
We evaluate our approach on the WMT'14 English-German (En-De) news translation task (Bojar et al., 2014) with En-De bilingual dictionary entries extracted from Wiktionary3 following Dinu et al. (2019), by matching the source and target phrases of the dictionary entries in the source and target sentences, respectively.
We also evaluate our approach on two En-De test sets released by Dinu et al. (2019) to compare our approach against previous work on applying lexical constraints in NMT (Post and Vilar, 2018;Dinu et al., 2019).The two test sets are subsets of WMT'17 En-De test set (Bojar et al., 2017) extracted using Wiktionary and the Interactive Terminology for Europe (IATE) terminology database,4 respectively.Both the WMT'14 and WMT'17 En-De datasets are tokenized using the Moses tokenization scripts and segmented into sub-word units using byte-pair encoding (Sennrich et al., 2016).

Results
We evaluate the systems using BLEU scores (Papineni et al., 2002) and term usage rate (Term%), which is defined as the number of constraints generated in the output divided by the total number of the given constraints.
Table 1 shows the result of (i) the baseline LevT model, (ii) with the constraint insertion operation (+ Constr.Ins.), (iii) with the constraint insertion Source "We don't want to charge that," she said.Baseline LevT "Das wollen wir nicht in Rechnung stellen", sagte sie.+ Constr.Ins.
We report results on both the filtered test set for sentence pairs that contain at least one target constraint ("Constr.",454 sentences) and the full test set ("Full", 3,003 sentences).The constraint insertion operation increases the term usage rate from about 80% to over 94%, and further disallowing deletion of the constraints achieves above 99% term usage.Prohibiting insertion between each constraint's tokens guarantees a 100% term usage.For sentences with lexical constraints, we observe a statistically significant improvement of 0.6 BLEU (p-value < 0.05) based on bootstrap resampling (Koehn, 2004).On the full test set, the BLEU improves by 0.1.The small margin of improvement is because only 1% of the total reference tokens are constraint tokens.Unlike previous work that sacrificed decoding speed to enforce lexical constraints (e.g.Hasler et al., 2018;Post and Vilar, 2018), there is no significant difference in the number of sentences decoded per second between the unconstrained and the lexically constrained LevT models.
Table 3 presents the comparison to two previous approaches: constrained decoding with dynamic beam allocation (Post and Vilar, 2018) and data augmentation by replacing the source terms with target constraints during training (Dinu et al., 2019).We refer to them as POST18 and DINU19, respectively, in Table 3.We evaluate each approach on the WMT'17 En-De test set with constraint terms from Wiktionary and IATE dictionaries.Note that our baseline LevT model with Transformer blocks of 6 layers is superior to that of Dinu et al. (2019) who used a 2-layer configuration.Despite having a stronger baseline, we obtain higher absolute BLEU  score improvements (0.96 and 1.16 BLEU on Wiktionary and IATE, respectively) and achieved 100% term usage.We report additional experiments on WMT'16 Romanian-English news translation task (Bojar et al., 2016) in the Appendix.

Analysis
To analyze if our approach inserts the constraints at correct positions, we compare it to a baseline approach of randomly inserting the constraint terms in the output of our baseline LevT model.Note that we only insert those constraints that are not already present in the output.Although this results in a 100% term usage, we observe that the BLEU score drops from 29.9 to 29.3 on the "Constr."WMT'14 test set, whereas our approach improves the BLEU score.The LevT model with our proposed constraint insertion seems to inherently have the ability to place the constraints at correct positions in the target sentence.
Although prohibiting constraint deletion improves term usage in the final translation and achieves higher BLEU scores, it limits the possibility of reordering when there is more than one constraint during inference.For the English-German test sets we evaluated on, 97-99% of the target constraints appear in the same order as the source terms.This issue may become more apparent in language pairs with more distinct syntactic differences between the source and target languages.In practice, most of the entries in terminology databases (Wiktionary, IATE, etc.) are often nominal.Thus, the reordering of lexical constraints boils down to whether the source and target language share the same argument-predicate order. 5We will explore potential strategies to reorder constraints dynamically in future work.

Conclusion
We proposed a non-autoregressive decoding approach to integrate lexical constraints for NMT.Our constraint insertion step is simple and we have empirically validated its effectiveness.The approach demonstrated control over constraint terms in target translations while being able to decode as fast as a baseline Levenshtein Transformer model, which achieves significantly higher decoding speed than traditional beam search. 6In addition to the terminological lexical constraints discussed in this work, future work can potentially modify insertion or selection operations to handle target translations of multiple forms; this can potentially disambiguate the morphological variants of the lexical constraints.

A Datasets
We train on 3,961,179 distilled sentence pairs released by Gu et al. (2019) and evaluate on WMT'14 En-De test set (3,003 sentences).The dictionary used in this work is created by sampling 10% En-De translation entries from Wiktionary, resulting in 10,522 entries.After applying this dictionary to generate constraints for the test set, we obtain 454 sentences that contain at least one constraint.The average number of constraints per sentence is 1.15 and the number of unique source constraints is 220.We use an English frequency list7 to filter the 500 most frequent words.We use the WMT'17 En-De test sets released by Dinu et al. (2019) 8 that were created based on Wiktionary and IATE term entries exactly matching the source and target.They contain 727 and 414 sentences, respectively.

C Additional Experiments
We train a LevT model on 599,907 training sentence pairs from the WMT'16 Romanian-English (Ro-En) news translation task (Bojar et al., 2016) using knowledge distillation routine based on Transformer base output and evaluate on 1,999 test sentences.Similar to En-De, we create a dictionary

Figure 1 :
Figure 1: Levenshtein Transformer decoding with lexical constraints for English-German MT.The source sentence is Nevada has completed a pilot project.and the target constraints are [Nevada, Pilot@@ projekt].Encoder and attention components are not shown.
, with a model size d model = 512 and feed-forward layer size d ff = 2048; the source and target embeddings share the same vocabulary.

Table 2 :
Example translations from the LevT with constraint insertion to enforce the translation of charge→berechnen.When deletion is allowed (+ Constr.Ins.) the imposed constraint (berechnen) gets deleted during decoding.But when deletion is disallowed (+ No Del.) and unwanted insertion between constraint tokens is prohibited (+ No Ins.), it guarantees the presence of our desired term in the final translation.We show more examples in the Appendix.

Table 3 :
Comparison to previous work.Baseline Transformer and POST18 results are from Dinu et al. (2019).
Table4shows the hyperparameter settings for our LevT model.We learn a joint BPE vocabulary with 32,000 operations.Their resulting vocabulary size is 39,843.