Integrating Domain Terminology into Neural Machine Translation

This paper extends existing work on terminology integration into Neural Machine Translation, a common industrial practice to dynamically adapt translation to a specific domain. Our method, based on the use of placeholders complemented with morphosyntactic annotation, efficiently taps into the ability of the neural network to deal with symbolic knowledge to surpass the surface generalization shown by alternative techniques. We compare our approach to state-of-the-art systems and benchmark them through a well-defined evaluation framework, focusing on actual application of terminology and not just on the overall performance. Results indicate the suitability of our method in the use-case where terminology is used in a system trained on generic data only.


Introduction
High out-of-the-box quality for Neural Machine Translation (Bojar et al., 2016) has boosted the adoption of automatic translation by the industry and invigorated the research and development on domain adaption and integration of technology in human translation workflows. For instance, combination with translation memories (Bulté and Tezcan, 2019;Xu et al., 2020), terminology handling (Hasler et al., 2018;Dinu et al., 2019), interactive translation (Peris and Casacuberta, 2019), post-editing modelling (Chatterjee et al., 2019) or dynamic adaptation (Farajian et al., 2017) are all different techniques to make machine translation part of real-life localization workflow.
In this work, we focus on integrating terminology as a quick way to dynamically specialize a translation to a specific domain. Terminology is a key high quality asset maintained by language specialists as part of a translation project: it is a way to guarantee language consistency, certify translation accuracy and define constraints to human translation. Terminologists are putting a lot of effort to describe terms, including their morphology, their syntax, the semantic context in which these terms apply, etc. From a human perspective, even though presentation and usage of dictionaries have evolved from ontology (as found in paper dictionary) to corpus-based presentation, looking up terms in a dictionary is the ultimate point of reference for validating the correct term for a specific domain in a specific context.
Terminology resources with all their sophistication have been the core building bricks and a continuous challenge to acquire in volume (Senellart et al., 2003) for rule-based engines. At the other extreme, they have been reduced to corpus or aligned "phrases" (Schwenk et al., 2008) for Statistical Machine Translation approaches, missing most of their intrinsic linguistic properties. In contrast, Neural Machine Translation operates on word and sentence representations in a continuous space so, on one hand, has access to deep actual linguistic knowledge (Conneau et al., 2018) and demonstrates a huge ability to generalize. But on the other hand, results are more difficult to interpret (Koehn and Knowles, 2017), and subsequently the translation process is far more complicated to control. Therefore, as for several other linguistic annotations, the challenge is how terminological information can be "passed" to the model.
In this work, we extend existing work on terminology adaptation, show similarity with translation memory, and propose a new approach and new benchmark through a well-defined evaluation framework focusing on actual application of terminology and not just on the overall performance.

2 Related Work
In recent years there has been significant work proposing methods to integrate such external specialized terminologies into NMT models. Mainstream techniques to tackle this challenge can be divided into three broad approaches, each showing different levels of performance when facing terminology injection issues, mainly inference overhead and generalization power. We illustrate their particularities on a common scenario using two English-Spanish terminology entries: [precedent ; precedente] (noun) and [to extend ; ampliar] (verb), in the following translation: These precedents can be extended. Se pueden ampliar esos precedentes.
Placeholders incorporate non-terminal tokens into NMT systems, which requires modifying the preand post-processing of the data, and training the system with data that contains the same placeholders which occur in the test sets (Crego et al., 2016). Following our example, source and translation terms appearing in the sentence pair are replaced by placeholders 1 : These <term#1> can be <term#2> . Se pueden <term#2> esos <term#1> .
The previous sentence pair is then used to feed the translation network, which learns to produce the target sentence with the corresponding placeholders. A similar workflow is applied in inference. Firstly, pre-processing work replaces found source terms and their morphological variants by placeholders (precedents ; <term#1> and extended ; <term#2>). Secondly, post-processing work is applied over the NMT output, where in turn translation terms replace placeholders (<term#1> ; precedentes and <term#2> ; ampliar) 2 . Note that the network loses any possibility to model the tokens in the terminology, since it has only access to placeholders. The method also lacks flexibility, as the model will always replace the placeholder with the same phrase irrespective of grammatical context 3 . In contrast, no computational overhead is applied at inference time by pre-and post-processing. The approach inherits from Luong et al. (2015) where words translated as out-of-vocabulary by the NMT network are post-processed using a dictionary.
Learning to apply constraints tackles the same problem by learning a copy behaviour of terminology at training time (Song et al., 2019;Dinu et al., 2019). The NMT model is trained to incorporate terminology translations when they are provided as additional input in the source sentence. Terminology translations are inserted as inline annotations, expecting the model to learn that additional words must be copied in the target hypothesis. The authors insert terminology translations in the source sentence either by appending the target term to its source version, or by directly replacing the original term with the target one. Both alternatives obtain similar translation accuracy results. An additional input stream is also used to signal the switch between source text and target terminology to be copied. Additional factors contain three values: 0 for source words, 1 for source terms, and 2 for target terms 4 : These 0 precedents 1 precedentes 2 can 0 be 0 extended 1 ampliar 2 . 0 Se pueden ampliar esos precedentes .
This approach uses a generic NMT architecture which learns to use an external terminology provided at run-time, thus, showing no inference overhead. However, similarly to the preceding approach, it lacks generalization power as it simply "copies" the term found in the terminology base in the source sentence, irrespective of the target hypothesis context. Dinu et al. (2019) argue that in some cases the approach exhibits the ability to inflect translation terms.
3927 fuzzy matching. Results show that the model acquires the ability to reuse the appended translations when producing its own hypotheses. The authors show impressive translation accuracy improvements when sufficiently large fuzzy matches exist in translation memories.
Constrained decoding enforces translation terms as decoding constraints applied at inference. Among others, Hokamp and Liu (2017) introduced grid beam search (GBS), an algorithm which employs a separate beam for each lexical constraint (translation term) aiming at ensuring the apparition of each given constraint in the translation hypothesis. The algorithm explores all possible constraints at each time-step, making sure not to generate a constraint that has already been generated in previous timesteps. The approach generates all the constraints in the final output. Other works (Hasler et al., 2018;Post and Vilar, 2018;Susanto et al., 2020) attempt to reduce the computational problem caused by using multiple beams in the inference, a well known weakness of this approach. Similar to the previous approach, constrained decoding does not consider target context when inserting translation terms, as it sets the target form and then produces a target context that fits this constraint. However, in a more realistic scenario, a source term may have multiple translation term inflections among which the MT engine should on-the-fly select the best one depending on the source and target context.
Previously, Chatterjee et al. (2017) proposed a guide mechanism to enhance an NMT network with the ability to prioritize translation options presented in the form of XML annotations of source words. The mechanism is applied at every inference time-step, where the beam search is influenced with external suggestions coming from the attention model. Similarly, Zhang et al. (2018) exploit a search engine to retrieve sentence pairs whose source sides are similar with the input sentence, from which they collect translation pieces. Then, the NMT model is modified to give an additional bonus to output sentences that contain the collected translation pieces.
Our contribution In this article, we compare several methods for domain terminology integration, seen as dynamic adaptation of a model trained on generic data to a specialized domain through terminology only. While results are expected to be lower than those obtained through fine-tuning (training more iterations with specialized parallel corpus), specializing with terminology only is a very frequent use case in industry, given that maintaining terminology lists make sense for experts to factorize the knowledge of frequently translated terms. We do not evaluate constrained decoding since comparison in Dinu et al. (2019) underlined that it did not outperform in-line terminology neither in BLEU nor in term usage rate, and its substantially increased decoding speed does not suit production environments.

Terminology Injection
This work builds on the placeholder method presented above. We extend the approach and adapt it to cover a wider variety of cases, and to control morphology to allow generalization power. To represent terminology we use several placeholders indicating part-of-speech (POS) and morphological information, both in source and target sides. 5 For each source-target term pair, we encode all possible inflections of the source and target word labelled with inflection type. Not only does this analysis enable to lexically match any inflected form of the source term, but it can also produce any inflected form of the translation term, ensuring full flexibility in the inflection choice made by the neural network. The model can then learn to translate a sequence of dedicated placeholders in source by a corresponding sequence of placeholders in target, this way providing the post-process with enough information to choose the right form among the multiple ones available for each translation term, thus ensuring the correct grammatical inflection in inference. Consider the previous example with extended placeholder annotations: These <noun or adj#1> <plural masculine> can be <verb#2> <past participle> .
A challenging case concerns homographs like the word precedents above. Source-side annotations indicate the homograph that can occur as a noun or an adjective, inflected in plural form. We also find it useful to convey in the source some information about the target word, namely that it is masculine, for the model to better integrate it in translation (article, agreement,...). The second term extended is unambiguously a verb in past participle. Target-side annotations indicate that in the context of this example, the homograph precedents translates into Spanish as a noun in plural while the second term extended translates into a verb in infinitive tense. Annotations vary according to the language pair. For example, to control inflections in English-Spanish, we annotate the following properties of each POS category: in English (source) in Spanish (target  Note that our approach does not require performing any linguistic annotation in inference. All annotations are already compiled in the terminology base acquired from specialized data. Following with the example, the word close triggers the use of the terminology placeholders: close ; <NNP A V#1> <s m> <+LEFTADJ> <W>, indicating that close is considered in our specialized terminology either as a noun, a verb or an adjective (b). The NMT network then produces the target hypothesis solving the ambiguity in translation (c), and post-processing converts remaining placeholders 8 into word forms by means of a set of rules (d).
A potential disadvantage of this approach is that actual instances of injected terminology are completely hidden to the neural network, that only handles placeholders, whereas this information can be valuable, with the exceptions of rare words or OOVs. We thus propose a second alternative where the source term is left in the source sentence surrounded by placeholders: These <NNP A#1> precedents <plural masculine> can be <V#2> extended <past participle> .

Corpora
Detailed statistics of the corpora used in this work are provided in Appendix B. Mainly, we use data coming from generic domains for both training and inference: Parallel Paragraphs crawled from the web (COMM); Proceedings of the European Parliament (EPPS); Legislative texts of the European Union (JRC); News Commentaries (NEWS). We use data from specialized domains for inference only: Documentation from the European Central Bank (ECB); Documents from the European Medicines Agency (EMEA); Localisation files (KDE4). All data is preprocessed using onmt-tokenize-text 9 .

Terminology Bases
Terminology databases are automatically extracted from the training sections of each corpus used in this work (see Appendix B). Parallel data is first word aligned with fast align 10 before extracting phrase pairs 11 . Pairs are kept as terminology entries when they follow a set of pre-defined POS patterns (see details in Appendix B) and only when pairs appear in the testset.

Neural Machine Translation
Our NMT models follow the state-of-the-art Transformer architecture described in Vaswani et al. (2017) implemented in the OpenNMT-tf 13 toolkit. Before learning, we train a 32K joint byte-pair encoding (Sennrich et al., 2016) not applying on introduced placeholders. Note that all models are learnt using a joint source and target vocabulary and shared word embeddings to allow the injection of target words in the source stream. This is only required by one configuration but it enables a fair comparison and does not harm the rest of models. Additional details of our translation networks are given in Appendix A.

Experiments
We evaluate the following configurations: • app: the target inflected term is appended to the source term. We use an additional parallel stream (factor) to indicate if each word is a term to inject and its respective belonging to source or target. Word embeddings are built after concatenating both factor embeddings (Dinu et al., 2019): These 0 precedents 1 precedentes 2 can 0 be 0 extended 1 ampliar 2 . 0 Se pueden ampliar esos precedentes .
It is worth mentioning that all models are trained from the same data and injected the same terminology, respectively at train and test time. For every testset, we evaluate each model under four different annotation conditions: • NONE: no injection of terms, to control for the performance of models trained with annotations when no annotation is injected in inference.
• MANY: injection of a large quantity of terms, to evaluate the ability of each configuration to handle multiple terms in a single sentence.
• ALREADY: injection of a reasonable quantity of terms already well translated in the baseline.
• IMPROVE: injection of a reasonable quantity of terms not already well translated in the baseline.
We evaluate each terminology injection configuration in equal and separable conditions, to better understand how each term of customer terminology, usually a mix of already well translated terms and terms benefiting from specialized translation, can contribute to translation improvement. To be able to evaluate terminology injection and influence on BLEU score for existing corpora, we place ourselves in a setting where injected terms are necessarily present in the reference. While we acknowledge that it does not fully reproduce a real scenario where there is usually no guarantee about the coverage of customer specialized terminology in the content to translate, however this experimental setting is, compared to the situation evaluated in Dinu et al. (2019): • closer to applied use cases by evaluating generic models on technical testsets and terminologies, • more controlled in the term match as it uses morphological analysis instead of approximate match, necessary to match forms such as sigue 'follows' from the verb seguir 'to follow' and • more complete as our terminologies cover not only fully inflected nouns, adjectives and verbs, but also noun phrases, verb phrases and homographs, recognizing the role of all these categories to specialize translation.

Results
Results in terms of BLEU score (Papineni et al., 2002) computed by multi-bleu.perl 14 are reported in Table 1. The NONE condition checks that, when no term is injected and trained for the same number of iterations, all three models trained with annotations (app, mrk, mrk+) reach a performance only slightly lower than the baseline (tok). In the case of mrk and mrk+, we hypothesize that they actually use less rich data during training since the placeholders are not lexicalized. In the MANY condition, when we inject a high number of terms, the app score makes a significant jump in specialized domains only, while scores of the models based on morphological marking (mrk, mrk+) suffer a substantial decrease in both generic and specialized domains, of higher importance for mrk and specialized domains. When we inject a "reasonable" quantity of terms, results highly depend on the nature of the injected terms. In the ALREADY condition, when terms are already well produced by the baseline, terminology injection creates a small drop for all models compared to the baseline, a drop that gets more important for mrk and specialized domains. These results indicate that models using morphological marking suffer from not having access to lexical instances, in particular when too many terms are injected, reflecting the limits of these models. However, in the IMPROVE condition, when injected terms where not already present in the baseline, terminology injection induces a considerable gain, in particular for specialized domains: for app (+0.38 in generic domains, +1.25 in specialized domains in average), and of larger magnitude for mrk (+0.50 generic, +2.15 specialized) and mrk+ (+0.71 generic, +2.08 specialized).

In-depth Evaluation
In parallel to translation quality scores measured by BLEU, we now examine the correct term use rate, as well as the distribution of the different types of errors concerning term integration in the hypothesis, illustrated in Table 2  We identify the following types of error: • Case: the term is integrated with a different casing (gráfico Vs. Gráfico) than in the reference.
• Inflection: the term is integrated with a different inflection than in the reference (includes number, gender and verb form errors). Note that the sentence stays perfectly grammatical as the model integrates the chosen term with a different but correct inflection (liquidadas Vs. liquidado).
• Homography: the integrated term is not the one in the reference, but corresponds to an homograph in source. This error does not necessarily make the sentence nonsensical, in the example translating a gerund driving by an infinitive verb conducir instead of a noun conducción.
• Absence: the term does not appear in the translation, with any difference of case, inflection or a form corresponding to an homograph, which means that the model has chosen to ignore the annotation to build its translation. Table 3 summarizes statistics of the error types observed in test sets. In MANY and ALREADY conditions, for both generic and specialized domains, app presents the highest term use rate, higher than the baseline and the mrk models. However, a closer look indicates that the errors made by mrk models are mostly due to case (around 6%) and inflection (10%), errors that may not necessarily make the translation worse for the human evaluator (see examples [1] and [2] in Table 4), while most errors of app come from its non-injection of the desired term (5%). We also verify that in the MANY condition, introducing too many terms does not help the models to generate consistent translations as it blurs the sequence of words in the sentence [3].   Coming to the IMPROVE condition, case and inflection errors persist at a similar rate for mrk models, but the rate of absent terms for app is growing to reach a noticeable level (34%): in all these cases, app prefers to ignore the information about the terminology it has been given to favor its own translation [4], sometimes identical to the baseline tok. Critically, this freedom leads app to fail injecting an irregular verb form, making the sentence ungrammatical [5], and complete drug names, making the translation much less secure to use [6,7]. With respect to mrk and mrk+ models, they have comparably high term injection rates but mrk offers slightly higher BLEU in specialized domains, and more control over the injected term: in particular for multiple-word terms, we have observed that mrk+ could erroneously repeat part of a compound [7], but mrk, that is blind to the injected terms being single or multiple words, can integrate both seamlessly.

Conclusion
Our major finding is that, in a context where the terminology introduces specialized terms that were not already well translated by the baseline, the app model -appending terminology as inline annotations in the source text -fails to inject terms at 34% and therefore does not guarantee the presence of expected terms in translations. This can be highly critical in a real setting when the user wants terminology to enforce the use of certified brands, product names, acronyms, but also business concepts, such as noun phrases and verbs. With the constraints that ones need to curate highly detailed linguistic resources and that the quantity of injected terms needs to be limited, the mrk models -representing expected terms by their morphological analysis -offer further guarantee of term injection with an absence rate of only 2%: when the exact term cannot be injected, the model usually injects a case or inflection variation that fits the translation. Additionally, the model can handle intricate patterns that are part of a vast majority of languages such as irregular forms, complex noun or verbal phrases, as well as multi-part and contextual entries. In contrast with the app model, that simply learns a copy behaviour from source to target agnostic to the context, the mrk models leverage the inner language knowledge of the neural network to perform morphological and syntactic analysis of the source, and more seamlessly generate the target. Aumento en los niveles de colesterol y de la férula , hiponatremia mrk+ Aumentar en los niveles de colesterol y de los sistemas de retención , hiponatremia [4] app fails to inject the correct term contrary to mrk models (but jeringa is in vocabulary) src Keep the syringe in the outer carton in order to protect from light . ref Mantener la jeringa en el embalaje exterior para protegerla de la luz . tok Mantenga la aguja en el cartón exterior para protegerse de la luz . app Mantenga la munición en el cartón exterior para proteger de la luz . mrk Mantenga la jeringa en el cartón exterior para proteger de la luz . mrk+ Mantenga la jeringa en el cartón exterior para proteger de la luz .
[6] app fails to inject a drug name (TYSABRI is OOV) src Use of TYSABRI has been associated with an increased risk of PML . ref El uso de TYSABRI se ha asociado a un incremento del riesgo de LMP . tok El uso de TYSABIRON ha sido asociado con un mayor riesgo de PML . app El uso de la TYSALine se ha asociado con un mayor riesgo de PML . mrk El uso de TYSABRI se ha asociado con un mayor riesgo de PML . mrk+ El uso de TYSABRI se ha asociado con un mayor riesgo de PML .
[7] app fails to inject a multi-word drug name, mrk+ repeats part of it src -tell you when you may need to use a higher or lower dose of Insuman Infusat ,

A NMT Configuration
The next table gives details of the network configuration used in our experiments: N d d f f h V batch Optimization U pdates beam 6 512 2,018 8 32,000 2,048 lazy Adam 300,000 5 where N is the number of layers; d is the size of both word embeddings and hidden layers; d f f is the size of inner feed forward layer; h is the number of heads; V is the length of the joint vocabulary used; batch is the learning batch size (in number of tokens) and beam indicates the inference beam size. For Adam (Kingma and Ba, 2015) optimization we set warm-up steps to 4, 000 and update learning rate for every 8 iterations. For training and inference we use a single NVIDIA P100 GPU.

B Corpora Statistics
Tables 5 and 6 respectively illustrate statistics of the different corpora used in this work 15 and the number of injected terms according to POS patterns. Corpora are randomly split, keeping 500 sentences for validation, 2, 000 (or 8, 000) for testing and the rest for training.