Multi-Domain Neural Machine Translation with Word-Level Domain Context Discrimination

With great practical value, the study of Multi-domain Neural Machine Translation (NMT) mainly focuses on using mixed-domain parallel sentences to construct a unified model that allows translation to switch between different domains. Intuitively, words in a sentence are related to its domain to varying degrees, so that they will exert disparate impacts on the multi-domain NMT modeling. Based on this intuition, in this paper, we devote to distinguishing and exploiting word-level domain contexts for multi-domain NMT. To this end, we jointly model NMT with monolingual attention-based domain classification tasks and improve NMT as follows: 1) Based on the sentence representations produced by a domain classifier and an adversarial domain classifier, we generate two gating vectors and use them to construct domain-specific and domain-shared annotations, for later translation predictions via different attention models; 2) We utilize the attention weights derived from target-side domain classifier to adjust the weights of target words in the training objective, enabling domain-related words to have greater impacts during model training. Experimental results on Chinese-English and English-French multi-domain translation tasks demonstrate the effectiveness of the proposed model. Source codes of this paper are available on Github https://github.com/DeepLearnXMU/WDCNMT.


Introduction
In recent years, neural machine translation (NMT) has achieved great advancement (Nal and Phil, 2013;Sutskever et al., 2014;Bahdanau et al., 2015). However, two difficulties are encountered in the practical applications of NMT. On the one hand, training a NMT model for a spe- * Corresponding author cific domain requires a large quantity of parallel sentences in such domain, which is often not readily available. Hence, the much more common practice is to construct NMT models using mixed-domain parallel sentences. In this way, the domain-shared translation knowledge can be fully exploited. On the other hand, the translated sentences often belong to multiple domains, thus requiring a NMT model general to different domains. Since the textual styles, sentence structures and terminologies in different domains are often remarkably distinctive, whether such domainspecific translation knowledge is effectively preserved could have a direct effect on the performance of the NMT model. Therefore, how to simultaneously exploit the exclusive and shared translation knowledge of mixed-domain parallel sentences for multi-domain NMT remains a challenging task.

EN
To tackle this problem, recently, researchers have carried out many constructive and in-depth studies (Kobus et al., 2016;Zhang et al., 2016;Pryzant et al., 2017;Farajian et al., 2017). However, most of these studies mainly focus on the utilization of domain contexts as a whole in NMT, while ignoring the discrimination of domain contexts at finer-grained level. In each sentence, some words are closely associated with its domain, while others are domain-independent. Intuitively, these two kinds of words play differ-ent roles in multi-domain NMT, nevertheless, they are not being distinguished by the current models. Take the sentence shown in Figure 1 for example. The Chinese words "'OE¬"(congress), "AE Y"(bills), " \"(inclusion), and "AE §"(agenda) are frequently used in Laws domain and imply the Laws style of the sentence, while other words in this sentence are common in all domains and they mainly indicate the semantic meaning of the sentence. Thus, it is reasonable to distinguish and encode these two types of words separately to capture domain-specific and domain-shared contexts, allowing the exclusive and shared knowledge to be exploited without any interference from the other. Meanwhile, the English words "priority","government", "bill" and "agenda" are also closely related to Laws domain. To preserve the domain-related text style and idioms in generated translations, it is also reasonable for our model to pay more attention to these domain-related words than the others during model training. On this account, we believe that it is significant to distinguish and explore word-level domain contexts for multi-domain NMT.
In this paper, we propose a multi-domain NMT model with word-level domain context discrimination. Specifically, we first jointly model NMT with monolingual attention-based domain classification tasks. In source-side domain classification and adversarial domain classification tasks, we perform two individual attention operations on source-side annotations to generate the domainspecific and domain-shared vector representations of source sentence, respectively. Meanwhile, an attention operation is also placed on target-side hidden states to implement target-side domain classification. Then, we improve NMT with the following two approaches: (1) According to the sentence representations produced by source-side domain classifier and adverisal domain classifier, we generate two gating vectors for each source annotation. With these two gating vectors, the encoded information of source annotation is selected automatically to construct domain-specific and domain-shared annotations, both of which are used to guide translation predictions via two attention mechanisms; (2) Based on the attention weights of the target words from target-side domain classifier, we employ word-level cost weighting strategy to refine our model training. In this way, domain-specific target words will be assigned greater weights than others in the objective function of our model.
Our work demonstrates the benefits of separate modeling of the domain-specific and domainshared contexts, which echoes with the successful applications of the multi-task learning based on shared-private architecture in many tasks, such as discourse relation recognition , word segmentation , text classification (Liu et al., 2017a), and image classification . Overall, the main contributions of our work are summarized as follows: • We propose to construct domain-specific and domain-shared source annotations from initial annotations, of which effects are respectively captured for translation predictions.
• We propose to adjust the weights of target words in the training objective of NMT according to their relevance to different domains.
• We conduct experiments on large-scale multi-domain Chinese-English and English-French datasets. Experimental results demonstrate the effectiveness of our model. Figure 2 illustrates the architecture of our model, which includes a neural encoder equipped with a domain classifier and an adversarial domain classifier, and a neural decoder with two attention models and a target-side domain classifier.

Neural Encoder
As shown in the lower part of Figure 2, our encoder leverages the sentence representations produced by these two classifiers to construct domain-specific and domain-shared annotations from initial ones, preventing the exclusive and shared translation knowledge from interfering with each other. In our encoder, the input sentence x=x 1 , x 2 , ..., x N are first mapped to word vectors and then fed into a bidirectional GRU (Cho et al., 2014) h N in the left-to-right and right-to-left directions, respectively. These two sequences are then concatenated as The architecture illustration of our model. Note that our two source-side domain classifiers are used to produce domain-specific and domain-shared annotations, respectively, and our target-side domain classifier is only used during model training.
two attention-like aggregators to generate the semantic representations of sentence x, denoted by the vectors E r (x) and E s (x), respectively. Based on these two vectors, we employ the same neural network to model two classifiers with different context modeling objectives: One is a domain classifier that aims to distinguish different domains in order to generate domain-specific source-side contexts. It is trained using the objective function J s dc (x; θ s dc ) = log p(d|x; θ s dc ), where d is the domain tag of x and θ s dc is its parameter set. The other is an adversarial domain classifier capturing source-side domainshared contexts. To this end, we train it using the following adversarial loss functions: where H(p(·))=− K k=1 p k (·) log p k (·) is an entropy of distribution p(·) with K domain labels, θ s1 adc and θ s2 adc denote the parameters of softmax layer and the generation layer of E s (x) in this classifier, respectively. By this means, E r (x) and E s (x) are expected to encode the domain-specific and domain-shared semantic representations of x, respectively. It should be noted that our utilization of domain classifiers is similar to adversarial training used in (Pryzant et al., 2017) which injects domain-shared contexts into annotations. However, by contrast, we introduce domain classifier and adversarial domain classifier simultaneously to distinguish different kinds of contexts for NMT more explicitly.
Here we describe only the modeling procedure of the domain classifier, while it is also applicable to the adversarial domain classifier. Specifically, E r (x) is defined as follows: where , and v a and W a are the relevant attention parameters. Then, we feed E r (x) into a fully connected layer with ReLU function (Ballesteros et al., 2015), and then pass its output through a softmax layer to implement domain classification where W s dc and b s dc are softmax parameters. Domain-Specific and Domain-Shared Annotations. Since domain-specific and domain-shared contexts have different effects on NMT, and thus should be distinguished and separately captured by NMT model. Specifically, we first leverage the sentence representations E r (x) and E s (x) to generate two gating vectors, g r i and g s i , for annotation h i in the following way: where W * gr , W * gs , b gr and b gs denote the relevant matrices and bias, respectively. With these two vectors, we construct domain-specific and domain-shared annotations h r i and h s i from h i :

Neural Decoder
The upper half of Figure 2 illustrates the architecture of our decoder. In particular, with the attention weights of target words from the domain classifier, we employ word-level cost weighting strategy to refine model training. Formally, our decoder applies a nonlinear function g( * ) to define the conditional probability of translation y=y 1 , y 2 , ..., y M : where the vector s j denotes the GRU hidden state. It is updated as Here the vectors c r j and c s j represent the domainspecific and domain-shared contexts, respectively.
Domain-Specific and Domain-Shared Context Vectors. When generating y j , we define c r j as a weighted sum of the domain-specific annotations {h r i }: where e r j,i = a(s j−1 , h r i ), and a(*) is a feedforward neural network. Meanwhile, we produce c s j from the domain-shared annotations {h s i } as in Eq. 11. By introducing c r j and c s j into s j , our decoder is able to distinguish and simultaneously exploit two types of contexts for translation predictions.
Domain Classifier. We equip our decoder with a domain classifier with parameters θ tdc , which maximizes the training objective i.e., J t dc (y; θ t dc ) = log p(d|y; θ t dc ). To do this, we also apply attention operation to produce the domain-aware semantic representation E r (y) of y, where and v b and W b are the related parameters. Likewise, we stack a domain classifier on top of E r (y).
Note that this classifier is only used in model training to infer attention weights of target words. These weights measure their semantic relevance to different domains and can be utilized to adjust their cost weights in NMT training objective. NMT Training Objective with Word-Level Cost Weighting. Formally, we define the objective function of NMT as follows: (1 + β j ) log p(y j |x, y <j ; θ nmt ), (13) where β j is the attention weight of y j obtained by Eq. (12), and θ nmt denotes the parameter set of NMT. By this scaling strategy, domainspecific words are emphasized with a bonus, while domain-shared words are updated as usual.
Please note that scaling costs with a multiplicative scalar essentially changes the magnitude of parameter update but without changing its direction (Chen et al., 2017a). Besides, although our scaling strategy is similar to the cost weighting proposed by Chen et al. (2017a), our approach differs from it in two aspects: First, we employ wordlevel cost weighting rather than sentence-level one to refine NMT training; Second, our approach is less time-consuming for multi-domain NMT.

Overall Training Objective
Given a mixed-domain training corpus D = {(x, y, d)}, we train the proposed model accord-ing to the following objective function: where J nmt ( * ), J s dc ( * ), J t dc ( * ) and J s * adc ( * ) are the objective functions of NMT, source-side domain classifier, target-side domain classifier, and source-side adversarial domain classifier, respectively, θ={θ nmt , θ s dc , θ t dc , θ s1 adc , θ s2 adc }, and λ is the hyper-parameter for adversarial learning.
Particularly, to ensure encoding accuracy of domain-shared contexts, we follow  to adopt an alternative two-phase strategy in training, where we alternatively optimize J (D; θ) with θ s1 adc and {θ-θ s1 adc } respectively fixed at a time.

Experiment
To investigate the effectiveness of our model, we conducted multi-domain translation experiments on Chinese-English and English-French datasets.

Setup
Datasets. For Chinese-English translation, our data comes from UM-Corpus (Tian et al., 2014) and LDC 1 . To ensure data quality, we chose only the parallel sentences with domain label Laws, Spoken, and Thesis from UM-Corpus, and the LDC bilingual sentences related to News domain as our dataset. We used randomly selected sentences from UM-Corpus and LDC as development set, and combined the test set of UM-Corpus and randomly selected sentences from LDC to construct our test set. For English-French translation, we conducted experiments on the datasets of OPUS corpus 2 , containing sentence pairs from Medical, News, and Parliamentary domains. We also divided these datasets into training, development and test sets. Table 1 provides the statistics of the corpora used in our experiments.
We performed word segmentation on Chinese sentences using Stanford Segmenter 3 , and tokenized English and French sentences using MOSES script 4 . Then, we employed Byte Pair  Encoding (Sennrich et al., 2016) to convert all words into subwords. The translation quality was evaluated by case-sensitive BLEU (Papineni et al., 2002).
Contrast Models. Since our model is essentially a standard attentional NMT model enhanced with word-level domain contexts, we refer to it as +WDC. We compared it with the following models, namely: • OpenNMT 5 . A famous open-source NMT system used widely in the NMT community trained on mix-domain training set.
• DL4NMT-finetune (Luong and Manning, 2015). A reimplemented attentional NMT which is first trained using out-of-domain training corpus and then fine-tuned using indomain dataset.
• +Domain Control (+DC) (Kobus et al., 2016). It directly introduces embeddings of source domain tag to enrich annotations of encoder.
• +Multitask Learning (+ML1) (Dong et al., 2015). It adopts a multi-task learning framework that shares encoder representation and separates the decoder modeling of different domains.
• +Target Token Mixing (+TTM) (Pryzant et al., 2017). This model is similar to +DC, with the only difference that it enriches source annotations by adding target-side domain tag rather than source-side one.
Note that our model uses two annotation sequences, thus we also compared it with the aforementioned models with two times of hidden state size (2×hd). To further examine the effectiveness of the proposed components in our model, we also provided the performance of the following variants of our model: • +WDC(S). It only exploits the source-side word-level domain contexts for multi-domain NMT.
• +WDC(T). It only employ word-level cost weighting on the target side to refine the model training.
Implementation Details. Following the common practice, we only used the training sentences within 50 words to efficiently train NMT models. Thus, 85.40% and 88.96% of the Chinese-English and English-French parallel sentences were covered in our experiments. In addition, we set the vocabulary size for Chinese-English and English-French as 32,000 and 32,000, respectively. In doing so, our vocabularies covered 99.97% Chinese words and 99.99% English words of the Chinese-English corpus, and almost 100% English words and 99.99% French words of the English-French corpus, respectively.
We applied Adam (Kingma and Ba, 2015) to train models and determined the best model parameters based on the model performance on development set. The used hyper-parameter were set as follows: β 1 and β 2 of Adam as 0.9 and 0.999, word embedding dimension as 500, hidden layer size as 1000, learning rate as 5×10 −4 , batch size as 80, gradient norm as 1.0, dropout rate as 0.1, and beamsize as 10. Other settings were set following (Bahdanau et al., 2015).  Overall Evaluation of the Chinese-English translation task. 2×hd = two times of hidden state size.

Results on Chinese-English Translation
We first determined the optimal hyper-parameter λ (see Eq. (14)) on the development set. To do this, we gradually varied λ from 0.1 to 1.0 with an increment of 0.1 in each step. Since our model achieved the best performance when λ=0.1, hence, we set λ=0.1 for all experiments thereafter. Table 2 shows the overall experimental results. Using almost the same hyper-parameters, our reimplemented DL4NMT outperforms OpenNMT in all domains, demonstrating that our baseline is competitive in performance. Moreover, on all test sets of different domains, our model significantly outperforms other contrast models no matter which hyper-parameters they use. Furthermore, we arrive at the following conclusions: First, our model surpasses DL4NMT-single, DL4NMT-mix and DL4NMT-finetune, all of which are commonly used in domain adaptation for NMT. Please note that DL4NMT-finetune requires multiple adapted NMT models to be constructed, while ours is a unified one that works well in all domains.
Second, compared with +DC, +ML2 and +ADM which all exploit source-side domain contexts for multi-domain NMT, our +WDC(S) still exhibits better performance. This is because that these models focus on one aspect of domain contexts, while our model considers both domainspecific and domain-shared contexts on the source side.
Third, +WDC(T) also outperforms DL4NMT, revealing that it is reasonable and effective to emphasize domain-specific words in model training..
Last, +WDC achieves the best performance when compared with both +WDC(S) and +WDC(T). Therefore, we believe that word-level domain contexts on the both sides are complementary to each other, and utilizing them simultaneously is beneficial to multi-domain NMT.

Experimental Analysis
Furthermore, we conducted several visualization experiments to empirically analyze the individual effectiveness of the added model components.

Visualizations of Gating Vectors
We first visualized the gating vectors g r i and g s i to quantify their effects on extracting domainspecific and domain-shared contexts from initial source-side annotations. Since both g r i and g s i are high dimension vectors, which are difficult to be visualized directly, we followed  and Zhou et al. (2017) to visualize their individual contributions to the final output, which can be The visualization of the sentence representations and their corresponding average annotations, where the triangle-shaped(purple), circle-shaped(red), square-shaped(green) and pentagonal-shaped(blue) points denote News, Laws, Spoken and Thesis sentences, respectively. approximated by their first derivatives. Figure 3 shows the first derivative heat maps for two example sentences in Laws and Thesis domain, respectively. We can observe that without any loss of semantic meanings from source sentences, most of the domain-specific words are strengthened by g r i , while most of the domainshared words, especially function words, are focused by g s i . This result is consistent with our expectation for the function of two gating vectors.

Visualizations of Sentence Representations and Annotations
Furthermore, we applied the hypertools (Heusser et al., 2018) to visualize the sentence representations E r (x) and E s (x), and the domain-specific and domain-shared annotation sequences Here we represent each annotation sequence with its average vector in the figure.
As shown in Figure 4 (a) and (b), the sentence representation vectors and the average annotation vectors of different domains are clearly distributed in different regions. By contrast, their distributions are much more concentrated in Figure 4 (c) and (d). Thus, we conclude that our model is able to distinctively learn domain-specific and domainshared contexts. Moreover, from Figure 4 (b), we observe that the sentence representation vectors of Laws domain does not completely coincide with  those of the other domains, this may be caused by the more formal and consistent sentence styles in Laws domain.

Illustrations of Domain-Specific Target Words
Lastly, for each domain, we presented the top ten target words with the highest weights learned by our target-side domain classifier. To do this, we calculated the average attention weight of each word in the training corpus as its corresponding domain weight. As is clearly shown in Table 3 that most listed target words are closely related to their domains. This result validates the aforementioned hypothesis that some words are domain-dependent while others are domain-independent, and our targetside domain classifier is capable of distinguishing them with different attention weights.

Results on English-French Translation
Likewise, we determined the optimal λ=0.1 on the development set. Table 4 gives the results of English-French multi-domain translation. Similar to the previous experimental result in Section 3.2, our model continues to achieve the best performance compared to all contrast models using two different hidden state size settings, which demonstrates again that our model is effective and general to different language pairs in multi-domain NMT.

Related Work
In this work, we study on multi-domain machine translation in the field of domain adaptation for machine translation, which has attracted great attention since SMT (Clark et al., 2012;Huck et al.,   2015; Sennrich et al., 2013). As for NMT, the dominant strategies for domain adaptation generally fall into two categories: The first category is to transfer out-of-domain knowledge to in-domain translation. The conventional method is fine-tuning, which first trains the model on out-of-domain dataset and then finetunes it on in-domain dataset (Luong and Manning, 2015;Zoph et al., 2016;Servan et al., 2016). Freitag and Al-Onaizan (2016) proceeded further by ensembling the fine-tuned model with the original one. Chu et al. (2017) fine-tuned the model using the mix of in-domain and out-of-domain training corpora. From the perspective of data selection, Chen et al. (2017a) scaled the top-level costs of NMT system according to each training sentence's similarity to the development set. Wang et al. (2017a) explored the data selection strategy based on sentence embeddings for NMT domain adaptation. Moreover, Wang et al. (2017b) further proposed several sentence and domain weighting methods with a dynamic weight learning strategy. However, these approaches usually only perform well on target domain while being highly time consuming in transferring translation knowledge to all the constitute domains.
The second category is to directly use a mixed-