How to Avoid Unwanted Pregnancies: Domain Adaptation using Neural Network Models

We present novel models for domain adaptation based on the neural network joint model (NNJM). Our models maximize the cross entropy by regularizing the loss function with respect to in-domain model. Domain adaptation is carried out by assigning higher weight to out-domain sequences that are similar to the in-domain data. In our alternative model we take a more restrictive approach by additionally penalizing sequences similar to the out-domain data. Our models achieve better perplexities than the baseline NNJM models and give improvements of up to 0.5 and 0.6 BLEU points in Arabic-to-English and English-to-German language pairs, on a standard task of translating TED talks.


Introduction
Rapid influx of digital data has galvanized the use of empirical methods in many fields including Machine Translation (MT). The increasing availability of bilingual corpora has made it possible to automatically learn translation rules that required years of linguistic analysis previously. While additional data is often beneficial for a general purpose Statistical Machine Translation (SMT) system, a problem arises when translating new domains such as lectures (Cettolo et al., 2014), patents (Fujii et al., 2010) or medical text (Bojar et al., 2014), where either the bilingual text does not exist or is available in small quantity. All domains have their own vocabulary and stylistic preferences which cannot be fully encompassed by a system trained on the general domain.
Machine translation systems trained from a simple concatenation of small in-domain and large out-domain data often perform below par because the out-domain data is distant or over-whelmingly larger than the in-domain data. Additional data increases lexical ambiguity by introducing new senses to the existing in-domain vocabulary. For example, an Arabic-to-English SMT system trained by simply concatenating inand out-domain data translates the Arabic phrase " " to "about the problem of unwanted pregnancy". This translation is incorrect in the context of the in-domain data, where it should be translated to "about the problem of choice overload". The sense of the Arabic phrase taken from out-domain data completely changes the meaning of the sentence. In this paper, we tackle this problem by proposing domain adaptation models that make use of all the data while preserving the in-domain preferences.
A significant amount of research has been carried out recently in domain adaptation. The complexity of the SMT pipeline, starting from corpus preparation to word-alignment, and then training a wide range of models opens a wide horizon to carry out domain specific adaptations. This is typically done using either data selection (Matsoukas et al., 2009) or model adaptation (Foster and Kuhn, 2007). In this paper, we further research in model adaptation using the neural network framework.
In recent years, there has been a growing interest in deep neural networks (NNs) and word embeddings with application to numerous NLP problems. A notably successful attempt on the SMT frontier was recently made by Devlin et al. (2014). They proposed a neural network joint model (NNJM), which augments streams of source with target n-grams and learns a NN model over vector representation of such streams. The model is then integrated into the decoder and used as an additional language model feature.
Our aim in this paper is to advance the state-ofthe-art in SMT by extending NNJM for domain adaptation to leverage the huge amount of out-domain data coming from heterogeneous sources. We hypothesize that the distributed vector representation of NNJM helps to bridge the lexical differences between the in-domain and the outdomain data, and adaptation is necessary to avoid deviation of the model from the in-domain data, which otherwise happens because of the large outdomain data.
To this end, we propose two novel extensions of NNJM for domain adaptation. Our first model minimizes the cross entropy by regularizing the loss function with respect to the in-domain model. The regularizer gives higher weight to the training instances that are similar to the in-domain data. Our second model takes a more conservative approach by additionally penalizing data instances similar to the out-domain data.
We evaluate our models on the standard task of translating Arabic-English and English-German language pairs. Our adapted models achieve better perplexities (Chen and Goodman, 1999) than the models trained on in-and in+out-domain data. Improvements are also reflected in BLEU scores (Papineni et al., 2002) as we compare these models within the SMT pipeline. We obtain gains of up to 0.5 and 0.6 on Arabic-English and English-German pairs over a competitive baseline system. The remainder of this paper is organized as follows: Section 2 gives an account on related work. Section 3 revisits NNJM model and Section 4 discusses our models. Section 5 presents the experimental setup and the results. Section 6 concludes.

Related Work
Previous work on domain adaptation in MT can be broken down broadly into two main categories namely data selection and model adaptation.

Data Selection
Data selection has shown to be an effective way to discard poor quality or irrelevant training instances, which when included in an MT system, hurts its performance. The idea is to score the outdomain data using a model trained from the indomain data and apply a cut-off based on the resulting scores. The MT system can then be trained on a subset of the out-domain data that is closer to in-domain. Selection based methods can be helpful to reduce computational cost when training is expensive and also when memory is constrained. Data selection was done earlier for lan-guage modeling using information retrieval techniques (Hildebrand et al., 2005) and perplexity measures (Moore and Lewis, 2010). Axelrod et al. (2011) further extended the work of Moore and Lewis (2010) to translation model adaptation by using both source-and target-side language models. Duh et al. (2013) used a recurrent neural language model instead of an ngram-based language model to do the same. Translation model features were used recently by (Liu et al., 2014;Hoang and Sima'an, 2014) for data selection. Durrani et al. (2015a) performed data selection using operation sequence model (OSM) and NNJM models.

Model Adaptation
The downside of data selection is that finding an optimal cut-off threshold is a time consuming process. An alternative to completely filtering out less useful data is to minimize its effect by downweighting it. It is more robust than selection since it takes advantage of the complete out-domain data with intelligent weighting towards the in-domain. Matsoukas et al. (2009) proposed a classification-based sentence weighting method for adaptation. Foster et al. (2010) extended this by weighting phrases rather than sentence pairs. Other researchers have carried out weighting by merging phrase-tables through linear interpolation (Finch and Sumita, 2008;Nakov and Ng, 2009) or log-linear combination (Foster and Kuhn, 2009;Bisazza et al., 2011;Sennrich, 2012) and through phrase training based adaptation (Mansour and Ney, 2013). Durrani et al. (2015a) applied EM-based mixture modeling to OSM and NNJM models to perform model weighting. Chen et al. (2013b) used a vector space model for adaptation at the phrase level. Every phrase pair is represented as a vector, where every entry in the vector reflects its relatedness with each domain.  also applied mixture model adaptation for reordering model.
In this paper, we do model adaptation using a neural network framework. In contrast to previous work, we perform it at the (bilingual) ngram level, where n is sufficiently large to capture long-range cross-lingual dependencies. The generalized vector representation of the neural network model reduces the data sparsity issue of traditional Markov-based models by learning better word classes. Furthermore, our specially designed loss functions for adaptation help the model to avoid deviation from the in-domain data without losing the ability to generalize.

Neural Network Joint Model
In recent years, there has been a great deal of effort dedicated to neural networks (NNs) and word embeddings with applications to SMT and other areas in NLP (Bengio et al., 2003;Auli et al., 2013;Kalchbrenner and Blunsom, 2013;Gao et al., 2014;Schwenk, 2012;Collobert et al., 2011;Mikolov et al., 2013a;Socher et al., 2013;Hinton et al., 2012). Recently, Devlin et al. (2014) proposed a neural network joint model (NNJM) and integrated it into the decoder as an additional feature. They showed impressive improvements in Arabic-to-English and Chinese-to-English MT tasks. Let us revisit the NNJM model briefly.
Given a source sentence S and its corresponding target sentence T , the NNJM model computes the conditional probability P (T |S) as follows: where, s i is a q-word source window for the target word t i based on the one-to-one (non-NULL) alignment of T to S. As exemplified in Figure 1, this is essentially a (p + q)-gram neural network LM (NNLM) originally proposed by Bengio et al. (2003). Each input word i.e. source or target word in the context is represented by a D dimensional vector in the shared look-up layer L ∈ R |V i |×D , where V i is the input vocabulary. 1 The look-up layer then creates a context vector x n representing the context words of the (p + q)-gram sequence by concatenating their respective vectors in L. The concatenated vector is then passed through nonlinear hidden layers to learn a high-level representation, which is in turn fed to the output layer. The output layer has a softmax activation over the output vocabulary V o of target words. Formally, the probability of getting k-th word in the output given the context x n can be written as: 1 Note that L is a model parameter to be learned.
where φ(x n ) defines the transformations of x n through the hidden layers, and w k are the weights from the last hidden layer to the output layer. For notational simplicity, henceforth we will use (x n , y n ) to represent a training sequence. By setting p and q to be sufficiently large, NNJM can capture long-range cross-lingual dependencies between words, while still overcoming the data sparseness issue by virtue of its distributed representations (i.e., word vectors). A major bottleneck, however, is to surmount the computational cost involved in training the model and applying it for MT decoding. Devlin et al. (2014) proposed two tricks to speed up computation in decoding. The first one is to pre-compute the hidden layer computations and fetch them directly as needed during decoding. The second technique is to train a self-normalized NNJM to avoid computation of the softmax normalization factor (i.e., the denominator in Equation 2) in decoding. However, self-normalization does not solve the computational cost of training the model. In the following, we describe a method to address this issue.

Training by Noise Contrastive Estimation
The standard way to train NNLMs is to maximize the log likelihood of the training data: where, y nk = I(y n = k) is an indicator variable (i.e., y nk =1 when y n =k, otherwise 0). Optimization is performed using first-order online methods, such as stochastic gradient ascent (SGA) with standard backpropagation algorithm. Unfortunately, training NNLMs are impractically slow because for each training instance (x n , y n ), the softmax output layer (see Equation 2) needs to compute a summation over all words in the output vocabulary. 2 Noise contrastive estimation or NCE (Gutmann and Hyvärinen, 2010) provides an efficient and stable way to avoid this repetitive computation as recently applied to NNLMs (Vaswani et al., 2013;Mnih and Teh, 2012). We can re-write Equation 2 as follows: where σ(.) is the un-normalized score and Z(.) is the normalization factor. In NCE, we consider Z(.) as an additional model parameter along with the regular parameters, i.e., weights, look-up vectors. However, it has been shown that fixing Z(.) to 1 instead of learning it in training does not affect the model performance (Mnih and Teh, 2012). For each training instance (x n , y n ), we add M noise samples (x n , y m n ) by sampling y m n from a known noise distribution ψ (e.g., unigram, uniform) M many times (i.e., m = 1 . . . M ); see Figure 1. NCE loss is then defined to discriminate a true instance from a noisy one. Let C ∈ {0, 1} denote the class of an instance with C = 1 indicating true and C = 0 indicating noise. NCE maximizes the following conditional log likelihood: where Q = P (y n , C = 1|x n , θ, π) + P (y m n , C = 0|x n , ψ, π) is a normalization constant. After removing the constant terms, Equation 6 can be further simplified as: where ψ nk =P (y m n = k|x n , ψ) is the noise distribution, σ nk =σ(y n = k|x n , θ) is the unnormalized score at the output layer (Equation 4), and y nk and y m nk are indicator variables as defined before. NCE reduces the number of computations needed at the output layer from |V o | to M + 1, where M is a small number in comparison with |V o |. In all our experiments we use NCE loss with M = 100 samples as suggested by Mnih and Teh (2012).

Neural Domain Adaptation Models
The ability to generalize and learn complex semantic relationships (Mikolov et al., 2013b) and its compelling empirical results gives a strong motivation to use the NNJM model for the problem of domain adaptation in machine translation. However, the vanilla NNJM described above is limited in its ability to effectively learn from a large and diverse out-domain data in the best favor of an indomain data. To address this, we propose two neural domain adaptation models (NDAM) extending the NNJM model. Our models add regularization to its loss function either with respect to in-domain or both in-and out-domains. In both cases, we first present the regularized loss function for the normalized output layer with the standard softmax, followed by the corresponding un-normalized one using the noise contrastive estimation.

NDAM v1
To improve the generalization of word embeddings, NNLMs are generally trained on very large datasets (Mikolov et al., 2013a;Vaswani et al., 2013). Therefore, we aim to train our neural domain adaptation models (NDAM) on in-plus out-domain data, while restricting it to drift away from in-domain. In our first model NDAM v1 , we achieve this by biasing the model towards the indomain using a regularizer (or prior) based on the in-domain model. Let θ i be an NNJM model already trained on the in-domain data. We train an adapted model θ a on the whole data, but regularizing it with respect to θ i . We redefine the normalized loss function of Equation 3 as follows: whereŷ nk (θ a ) is the softmax output and p nk (θ i ) is the probability of the training instance according to the in-domain model θ i . Notice that the loss function minimizes the cross entropy of the current model θ a with respect to the gold labels y n and the in-domain model θ i . The mixing parameter λ ∈ [0, 1] determines the relative strength of the two components. 3 Similarly, we can re-define the NCE loss of Equation 7 as: We use SGA with backpropagation to train this model. The derivatives of J(θ a ) with respect to the final layer weight vectors w j turn out to be: 3 We used a balanced value λ = 0.5 for our experiments.

NDAM v2
The regularizer in NDAM v1 is based on an indomain model θ i , which puts higher weights to the training instances (i.e., n-gram sequences) that are similar to the in-domain ones. This might work better when the out-domain data is similar to the in-domain data. In cases where the outdomain data is different, we might want to build a more conservative model that penalizes training instances for being similar to the out-domain ones. Let θ i and θ o be the two NNJMs already trained from the in-and out-domains, respectively, and θ o is trained using the same vocabulary as θ i . We define the new normalized loss function as follows: where y nk ,ŷ nk (θ a ), p nk (θ i ) and p nk (θ o ) are similarly defined as before. This loss function minimizes the cross entropy of the current model θ a with respect to the gold labels y n and the difference between the in-domain model θ i and the outdomain model θ o . Intuitively, the regularizer assigns higher weights to training instances that are not only similar to the in-domain but also dissimilar to the out-domain. The parameter λ ∈ [0, 1] determines the strength of the regularization. The corresponding NCE loss can be defined as follows: The derivatives of the above cost function with respect to the final layer weight vectors w j are: In a way, the regularizers in our loss functions are inspired from the data selection methods of Axelrod et al. (2011), where they use cross entropy between the in-and the out-domain LMs to score out-domain sentences. However, our approach is quite different from them in several aspects. First and most importantly, we take the scoring inside model training and use it to bias the training towards the in-domain model. Both the scoring and the training are performed at the bilingual n-gram level rather than at the sentence level. Integrating scoring inside the model allows us to learn a robust model by training/tuning the relevant parameters, while still using the complete data. Secondly, our models are based on NNs, while theirs utilize the traditional Markov-based generative models.

Technical Details
In this section, we describe some implementation details of NDAM that we found to be crucial, such as: using gradient clipping to handle vanishing/exploding gradient problem in SGA training with backpropagation, selecting appropriate noise distribution in NCE, and special handling of outdomain words that are unknown to the in-domain.

Gradient Clipping
Two common issues with training deep NNs on large data-sets are the vanishing and the exploding gradients problems (Pascanu et al., 2013). The error gradients propagated by the backpropagation may sometimes become very small or very large which can lead to undesired (nan) values in weight matrices, causing the training to fail. We also experienced the same problem in our NDAM quite often. One simple solution to this problem is to truncate the gradients, known as gradient clipping (Mikolov, 2012). In our experiments, we limit the gradients to be in the range [−5; +5].

Noise Distribution in NCE
Training with NCE relies on sampling from a noise distribution (i.e., ψ in Equation 5), and the performance of the NDAM models varies considerably with the choice of the distribution. We explored uniform and unigram noise distributions in this work. With uniform distribution, every word in the output vocabulary has the same probability to be sampled as noise. The unigram noise distribution is a multinomial distribution over words constructed by counting their occurrences in the output (i.e., n-th word in the n-gram sequence). In our experiments, unigram distribution delivered much lower perplexity and better MT results compared to the uniform one. Mnih and Teh (2012) also reported similar findings on perplexity.

Handling of Unknown Words
In order to reduce the training time and to learn better word representations, NNLMs are often trained on most frequent vocabulary words only and low frequency words are represented under a class of unknown words, unk. This results in a large number of n-gram sequences containing at least one unk word and thereby, makes unk a highly probable word in the model. 4 Our NDAM models rely on scoring out-domain sequences (of word Ids) using models that are trained based on the in-domain vocabulary. To score out-domain sequences using a model, we need to generate the sequences using the same vocabulary based on which the model was trained. In doing so, the out-domain words that are unknown to the in-domain data map to the same unk class. As a result, out-domain sequences containing unks get higher probability although they are distant from the in-domain data.
A solution to this problem is to have an indomain model that can differentiate between its own unk class, resulted from the reduced indomain vocabulary, and actual unknown words that come from the out-domain data. We introduce a new class unk o to represent the latter. We train the in-domain model by adding a few dummy sequences containing unk o occurring on both source and target sides. This enables the model to learn unk and unk o separately, where unk o is a less probable class according to the model. Later, the n-gram sequences of the outdomain data contain both unk and unk o classes depending on whether a word is unknown to only pruned in-domain vocabulary (i.e., unk) or is unknown to full in-domain vocabulary (i.e., unk o ).

Evaluation
In this section, we describe the experimental setup (i.e., data, settings for NN models and MT pipeline) and the results. First we evaluate our models intrinsically by comparing the perplexities on a held-out in-domain testset against the baseline NNJM model. Then we carry out an extrinsic evaluation by using the NNJM and NDAM models as features in machine translation and compare the BLEU scores. Initial developmental experiments were done on the Arabic-to-English language pair.
We carried out further experiments on the Englishto-German pair to validate our models.

Data
We experimented with the data made publicly available for the translation task of the International Workshop on Spoken Language Translation (IWSLT) (Cettolo et al., 2014). We used TED talks as our in-domain corpus. For Arabic-to-English, we used the QCRI Educational Domain (QED) -A bilingual collection of educational lectures 5 (Abdelali et al., 2014), the News, and the multiUN (UN) (Eisele and Chen, 2010) as our out-domain corpora. For English-to-German, we used the News, the Europarl (EP), and the Common Crawl (CC) corpora made available for the 9 th Workshop of Statistical Machine Translation. 6 Table 1 shows the size of the data used.
Training NN models is expensive. We, therefore, randomly selected subsets of about 300K sentences from the bigger domains (UN, CC and EP) to train the NN models. 7 The systems were tuned on concatenation of the dev. and test2010 and evaluated on test2011-2013 datasets. The tuning set was also used to measure the perplexities of different models.

System Settings
NNJM & NDAM: The NNJM models were trained using NPLM 8 toolkit (Vaswani et al., 2013) with the following settings. We used a target context of 5 words and an aligned source window of 9 words, forming a joint stream of 14grams for training. We restricted source and target side vocabularies to the 20K and 40K most frequent words. The word vector size D and the hidden layer size were set to 150 and 750, respectively. Only one hidden layer is used to allow faster decoding. Training was done by the standard stochastic gradient ascent with NCE using 5 Guzmán et al. (2013) showed that the QED corpus is similar to IWSLT and adding it improves translation quality. 6 http://www.statmt.org/wmt14/translation-task.html 7 Concatenating all the data results in a corpus of approximately 4.5 million sentences which requires roughly 18 days of wall-clock time (18 hours/epoch on a Linux Ubuntu 12.04.5 LTS running on a 16 Core Intel Xeon E5-2650 2.00Ghz and 64Gb RAM) to train NNJM models on our machines. We ran one baseline experiment with all the data and did not find it better than the system trained on randomly selected subset of the data. In the interest of time, we therefore reduced the NN training to a subset (800K and 1M sentences for AR-EN and EN-DE respectively). 8 http://nlg.isi.edu/software/nplm/  100 noise samples and a mini-batch size of 1000. All models were trained for 25 epochs. We used identical settings to train the NDAM models, except for the special handling of unk tokens.
Machine Translation System: We trained a Moses system (Koehn et al., 2007), with the following settings: a maximum sentence length of 80, Fast-Aligner for word-alignments (Dyer et al., 2013), an interpolated Kneser-Ney smoothed 5-gram language model with KenLM (Heafield, 2011), lexicalized reordering model (Galley and Manning, 2008), a 5-gram operation sequence model (Durrani et al., 2015b) and other default parameters. We also used an NNJM trained with the settings described above as an additional feature in our baseline system. In adapted systems, we replaced the NNJM model with the NDAM models. We used ATB segmentation using the Stanford ATB segmenter (Green and DeNero, 2012) for Arabic-to-English and the default tokenizer provided with the Moses toolkit (Koehn et al., 2007) for the English-to-German pair. Arabic OOVs were translated using an unsupervised transliteration module in Moses . We used k-best batch MIRA (Cherry and Foster, 2012) for tuning.

Intrinsic Evaluation
In this section, we compare the NNJM model and our NDAM models in terms of their perplexity numbers on the in-domain held-out dataset (i.e., dev+test2010). We choose Arabic-English language pair for the development experiments and train domain-wise models to measure the relatedness of each domain with respect to the in-domain. We later replicated selective experiments for the English-German language pair. The first part of  showing that there is useful information available in each domain which can be utilized to improve the baseline. It also shows the robustness of neural network models. Unlike the n-gram model, the NN-based model improves generalization with the increase in data without completely skewing towards the dominating part of the data. Concatenating in-domain with the NEWS data gave better perplexities than other domains. Best results were obtained by concatenating all the data together (See row ALL). The third and fourth columns show results of our models (NDAM v * ). Both give better perplexities than NNJM cat in all cases. However, it is unclear which of the two is better. Similar observations were made for the English-to-German pair, where we only did experiments on the concatenation of all domains.

Extrinsic Evaluation
Arabic-to-English: For most language pairs, the conventional wisdom is to train the system with all available data. However, previously reported MT results on Arabic-to-English (Mansour and Ney, 2013) show that this is not optimal and the results are often worse than only using indomain data. The reason for this is that the UN domain is found to be distant and overwhelmingly large as compared to the in-domain IWSLT data. We carried out domain-wise experiments and also found this to be true.
We considered three baseline systems: (i) B in ,   Table 4 shows results of the MT systems S v1 and S v2 using our adapted models NDAM v1 and NDAM v2 . We compare them to the baseline system B cat , which uses the non-adapted NNJM cat as a feature. S v1 achieved an improvement of up to +0.4 and S v2 achieved an improvement of up to +0.5 BLEU points. However, S v2 performs slightly worse than S v1 on individual domains. We speculate this is because of the nature of the NDAM v2 , which gives high weight to out-domain sequences that are liked by the in-domain model and disliked by the out-domain model. In the case of individual domains, NDAM v2 might be over penalizing out-domain since the out-domain model is only built on that particular domain and always prefers it more than the in-domain model. In case of ALL, the out-domain model is more diverse and has different level of likeness for each domain.
We analyzed the output of the baseline system (S cat ) and spotted several cases of lexical ambiguity caused by out-domain data. For example, the Arabic phrase can be translated to choice overload or unwanted pregnancy. The latter translation is incorrect in the context of in-domain. The bias created due to the out-domain data caused S cat to choose the contextually incorrect translation unwanted pregnancy. However, the adapted systems S v * were able to translate it   (How about fitness?), the word is translated to proprietary by S cat , a translation frequently observed in the out-domain data. S v * translated it correctly to fitness, as preferred by the in-domain.
English-to-German: Concatenating all training data to train the MT pipeline has been shown to give the best results for English-to-German (Birch et al., 2014). Therefore, we did not do domainwise experiments, except for training a system on the in-domain IWSLT data for the sake of completeness. We also tried B cat,in variation, i.e. training an MT system on the entire data and using in-domain data to train the baseline NNJM. The baseline system B cat gave better results and was used as our reference for comparison. Table 5 shows the results of our systems, S v1 and S v2 , compared to the baselines, B in and B cat . Unlike Arabic-to-English, the baseline system B in is much worse than B cat . Our adapted MT systems S v1 and S v2 both outperformed the best baseline system (B cat ) with an improvement of up to 0.6 points. S v2 performed slightly better than S v1 on one occasion and slightly worse in others.
Comparison with Data Selection: We also compared our results with the MML-based data   Table 6. The MML-based baseline systems (B mml ) used 20% selected data for training the MT system and the NNJM. On Arabic-English, both MML-based selection and our model (S v1 ) gave similar gains on top of the baseline system (B cat ). Further results showed that both approaches are complementary. We were able to obtain an average gain of +0.3 BLEU points by training an NDAM v1 model over the selected data (see S v1+mml ). However, on English-German, the MML-based selection caused a drop in the performance (see Table 6). Training an adapted NDAM v1 model over selected data gave improvements over MML in two test sets but could not restore the baseline performance, probably because the useful data has already been filtered by the selection process.

Conclusion
We presented two novel models for domain adaptation based on NNJM. Adaptation is performed by regularizing the loss function towards the indomain model and away from the unrelated outof-domain data. Our models show better perplexities than the non-adapted baseline NNJM models. When integrated into a machine translation system, gains of up to 0.5 and 0.6 BLEU points were obtained in Arabic-to-English and Englishto-German systems over strong baselines.