HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation

Bilingual lexicons are valuable resources used by professional human translators. While these resources can be easily incorporated in statistical machine translation, it is unclear how to best do so in the neural framework. In this work, we present the HABLex dataset, designed to test methods for bilingual lexicon integration into neural machine translation. Our data consists of human generated alignments of words and phrases in machine translation test sets in three language pairs (Russian-English, Chinese-English, and Korean-English), resulting in clean bilingual lexicons which are well matched to the reference. We also present two simple baselines - constrained decoding and continued training - and an improvement to continued training to address overfitting.


Introduction
Neural machine translation (NMT) is the current state-of-the-art. In contrast with statistical machine translation (SMT; Koehn et al., 2007), where there are several established methods of incorporating external knowledge, 1 recent work is still examining how best to incorporate bilingual lexicons into NMT systems. Bilingual lexicon integration is desirable in a number of scenarios: highly technical vocabulary (which might be rare, or require translations of a domain-specific sense), lowerresource settings (where bilingual lexicons might be a significant portion of the available parallel data), translation settings where a client requires particular terms to be used (e.g. brand names), or for improving rare word translation.
At present, there is no standard dataset for benchmarking bilingual lexicon integration, making it difficult to compare methods.

амортизаторы↔shock absorbers
Target shock absorbers are mounted on a front wall of the housing . We create and release 2 Human Annotated Bilingual Lexicons (HABLex). Our bilingual lexicons are (1) generated by bilingual experts to ensure high quality, (2) derived from the development and test references so the best-case-scenario impact on translation performance can be directly measured, (3) covering 3 language pairs, and (4) focused on challenging words.
We perform exploratory work on our development set, showing two representative baselines to compare incorporating the lexicon at training time (continued training) vs. at decoding time (constrained decoding). We examine the tradeoffs in terms of BLEU, recall, and speed.
We also present a novel application of Elastic Weight Consolidation (EWC; Kirkpatrick et al., 2017;Thompson et al., 2019) which significantly improves performance by preventing overfitting during continued training on the bilingual lexicon.

Related Work
We first review prior work on incorporation of bilingual lexicons into NMT and then discuss the datasets used and explain how our new dataset addresses some shortcomings.
Recent work on the incorporation of bilingual lexicons into NMT systems can be loosely clus-tered into two categories: incorporation at training time or during decoding. These two general approaches have different performance characteristics: incorporation at training time may slow down training, but tends not to alter inference speed, while incorporation at inference time tends to significantly slow down decoding (see subsection 4.3), but without slowing down training. Zhang and Zong (2016) and Fadaee et al. (2017) both propose using bilingual lexicons to create synthetic bitext to augment training data for NMT systems. Arthur et al. (2016) use translation probabilities from a lexicon (like SMT phrase tables) in conjunction with NMT probabilities. Kothur et al. (2018) perform fine-grained continued training adaptation on very small, documentspecific bilingual lexicons of novel words. 3 A popular inference-time approach is constrained decoding (Anderson et al., 2016;Hokamp and Liu, 2017;Hasler et al., 2018;Post and Vilar, 2018), which modifies beam search to require that user-specified words or phrases to be present in the output hypotheses. Constrained decoding can be used to ensure that target entries from a bilingual lexicon be present in the MT output whenever their corresponding source entries are present in the input. Constrained decoding with multiple target options (e.g. when a source word can be translated into one of several target words) is addressed in  and Hasler et al. (2018).

Datasets Used in Prior Studies
Prior work has used either human-generated general purpose bilingual lexicons not tailored to a test set (Arthur et al., 2016;Zhang and Zong, 2016) or automatic alignments (Fadaee et al., 2017;Hokamp and Liu, 2017;Hasler et al., 2018;Kothur et al., 2018) which likely contain errors (especially on rare words). There exist manually word-aligned parallel corpora for Japanese-English Wikipedia data (Neubig, 2011) and for Chinese-English mixed-domain data (Liu and Sun, 2015), from which it would be possible to extract lexicons.
Most prior work experimented in a single language pair, 4 making it difficult to predict if a method will generalize. Our work addresses these issues by providing a multi-language testbed for comparisons, focusing on a consistent and lexically-challenging domain.

HABLex Dataset
Our motivation is to allow straightforward evaluation of lexicon incorporation methods. By building a reference-derived bilingual lexicon, we ensure that if a method successfully learns correct translations from the lexicon while maintaining overall system performance, BLEU will increase.
We annotate the existing parallel text with alignments for certain vocabulary items, allowing us to extract bilingual lexicons that are specific to the context in which the translation should appear. This produces a cleaner and more well-tailored bilingual lexicon than would be found in most realworld scenarios. However, it could also be used as a standardized starting point to produce a noisier lexicon that more closely mimics real-world lexicons (e.g. by adding irrelevant entries or relevant morphological variants, lemmatizing entries, or subsampling).
At a high level, our lexicons are created by a twostep process: (1) identifying interesting words on the source side of the test and development sets, and (2) human annotators correcting or validating automatic alignments of the identified words.

Patent Domain & Languages
We chose the patent domain because it contains interesting technical terminology and because of the availability of the high-quality, multilingual World Intellectual Property Organization (WIPO) COPPA-V2 corpus (Junczys-Dowmunt et al., 2016).
We build bilingual lexicons for three language pairs: Russian→English (Ru→En), Korean→English (Ko→En), and Chinese→English (Zh→En). We also release the development and test splits from which the annotations were produced.

Interesting Word Selection
We select interesting words (from the source side of the data) as follows: words that appear less than 5 times in WIPO data, words that appear less than  5 times in general domain training data (see section 4), and words that appear less than 5 times in both. 5 These represent three types of words that may be challenging to translate: words that have only recently been learned, words that may have been forgotten during continued training, and generally rare words.

Human Annotations
Our bilingual annotators used the LDC Word Aligner tool (Grimes, 2010) to annotate words of interest in context. Table 1 shows example lexicon entries resulting from this annotation process. In order to save annotator time and effort, we displayed highlighted automatic alignments 6 of each of the words of interest in their parallel sentence context (similar to Grimes et al. (2012)). The annotator confirms or corrects the automatic alignment, adding or removing source or target side words as needed to complete a valid alignment. Since annotation is cumbersome with very long sentences, we omit sentences of length 100 or more tokens. Filtering out numerical entries and phrases longer than 3 words 7 results in bilingual lexicons with sizes shown in Table 2. The data contains a small number of discontiguous alignments; these are so infrequent as to have a negligible effect on BLEU score results. 5 We focus on 5-count (and rarer) words because we expect them to be particularly challenging for MT, but given time and resources there is nothing to prevent the application of the annotation protocol to other terms. 6 For strong initial alignments, GIZA++ (Och and Ney, 2003) is trained on train, development, and test for all data described in section 4 as well as a TED talk corpus (Cettolo et al., 2012). 7 While we want the dictionary to match the reference, we did not want to train on large phrases from the reference.

Baseline Experiments and Results
To build strong baseline systems, we first train models on general domain data. As general domain data, we use the OpenSubtitles18 (Lison et al., 2018) corpora for both Ru→En and Ko→En. For Ru→En and Zh→En we also use the parallel portion of the WMT17 news translation task (Bojar et al., 2017). We then fine-tune these generaldomain models on WIPO training data (Luong and Manning, 2015), using the dev set for validation. These domain-adapted models are then used as the initial systems for our lexicon incorporation experiments.
We build the systems in Sockeye (Hieber et al., 2017), using a two-layer LSTM network with hidden unit size 512. We use an initial learning rate of 3e-4 both for training the general domain model and adapting to WIPO. We apply the Moses tokenizer (Koehn et al., 2007), lowercasing, and bytepair encoding (BPE; Sennrich et al., 2016) with a vocabulary size of 30k. BPE is trained on the general domain corpus only, then applied to all data.

Evaluation Metrics
We evaluate lexicon incorporation approaches using two main metrics: BLEU and recall. For each annotated instance of a source-side lexical entry, we can check whether the system output contains the correct aligned target-side translation. Recall is computed as the percentage of the time that the system produces the correct output, averaged over all annotations. Note that this does not guarantee that the words are placed in a sensible location in the sentence, only that they appear.
We also consider training and decoding speed. All of these factors help determine which approaches are best given a particular use case. For example, if a user requires exact fidelity to a lexicon (e.g. branding), they may care the most about recall (while still ensuring that overall translation quality is still acceptable).

Continued Training on HABLex
We perform continued training (CT; Luong and Manning, 2015)-typically used for domain adaptation-on the bilingual lexicon data. With bilingual lexicons approximately two orders of magnitude larger than those used successfully in Kothur et al. (2018), we find that performance drops dramatically with standard CT. To address this problem, we apply EWC (Kirkpatrick et Table 3: BLEU and recall % (rec.) of baseline (BL), continued training (CT) with and without EWC, and oracle and random constrained decoding (CD) on development set. 8 2017), a method for training a neural network to learn a new task without catastrophically forgetting how to perform a previously learned task. EWC has recently been shown to significantly reduce general domain performance loss during domain adaptation in NMT (Thompson et al., 2019); we apply it to retain the ability to translate full sentences while training on a bilingual lexicon. We experimented with initial learning rates (0.001, 0.00316, 0.01, 0.0316, 0.1), all using SGD for 300 epochs, and EWC weight decay values (1e-1, 1e-2, 1e-3, 1e-4, 1e-5). We chose a learning rate 0.1 because it allowed recall to (at least approximately) converge, and a weight decay value of 1e-4 because it performed reasonably well on the dev sets for all 3 languages. All results reported are for the final checkpoint.

Constrained Decoding
We employ dynamic beam allocation (DBA; Post and Vilar, 2018) for constrained decoding (CD). At each time step of decoding, DBA groups hypothe-8 The recall of the oracle method would be 100%, if there were no out-of-vocabulary subwords in the bilingual lexicon.  ses into banks based on the number of translated constraints, and a fixed-size beam is dynamically allocated across the banks. While DBA has a time complexity constant in the number of constraints, in practice we find it to be approximately an order of magnitude slower than regular decoding, primarily because Sockeye does not currently support batched DBA. 9 One limitation of DBA is that it only works on constraints that have only one translation. In this work, we report oracle choice (use the right sense for the specific test sentence) as an upper bound on performance and random choice (choose a random possible translation) as another baseline. Our dictionary may contain more than one possible translation for a given target word. For example, we observe the following three English translations for the Russian word арматурного: rebar, reinforced, and reinforcement. These translations are appropriate in different contexts and are not interchangeable, so the oracle always selects the translation that is appropriate for the given sentence (according to its reference translation). The random approach selects one of the translations uniformly at random; if there is only one translation, random is identical to oracle. Table 3 summarizes key exploratory results for the baseline, CT, and CD approaches on the development set. Table 4 shows baseline and random CD benchmarks on the test set, which we otherwise reserve and release for future evaluation. We confine the remainder of our discussion to the experiments we performed on the development set.

Results & Discussion
EWC noticeably improves BLEU performance as compared to standard CT, while also increasing recall. (See Figure 1 for example performance as the model is trained on the bilingual lexicon and Table 3 for results on all three language pairs.) CD outperforms CT in terms of both BLEU score and recall, at the expense of decoding speed. Random CD nears oracle CD performance because most source-side entries have only one translation. We would not necessarily expect such high random CD performance if the bilingual lexicon averaged more senses per source entry.
The use case for bilingual lexicon incorporation should also influence user decisions about what approaches to use. If exact translations (e.g. brand names or highly technical translations) are of the utmost importance, a CD approach might be preferred even if it is slower. In the case that a general lexicon has been provided but there is more flexibility in terminology, it may be better to perform CT, or perhaps even a combination of the two.

Conclusions
Bilingual lexicons are important resources in translation, but it is not clear how to best incorporate them in NMT. To address this challenge, we present the HABLex dataset, a multi-language, reference-derived development and test set that facilitates the evaluation of lexicon incorporation. We compare two baselines, based on incorporation at training time or at decode time, in terms of BLEU, recall, and speed. We also present a novel application of EWC to continued training which addresses lexicon overfitting.