Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization

Although neural machine translation has made significant progress recently, how to integrate multiple overlapping, arbitrary prior knowledge sources remains a challenge. In this work, we propose to use posterior regularization to provide a general framework for integrating prior knowledge into neural machine translation. We represent prior knowledge sources as features in a log-linear model, which guides the learning processing of the neural translation model. Experiments on Chinese-English dataset show that our approach leads to significant improvements.


Introduction
The past several years have witnessed the rapid development of neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015), which aims to model the translation process using neural networks in an end-to-end manner. With the capability of capturing long-distance dependencies due to the gating (Hochreiter and Schmidhuber, 1997;Cho et al., 2014) and attention (Bahdanau et al., 2015) mechanisms, NMT has shown remarkable superiority over conventional statistical machine translation (SMT) across a variety of natural languages (Junczys-Dowmunt et al., 2016).
Despite the apparent success, NMT still suffers from one significant drawback: it is difficult to integrate prior knowledge into neural networks. On one hand, neural networks use continuous realvalued vectors to represent all language structures involved in the translation process. While these vector representations prove to be capable of capturing translation regularities implicitly (Sutskever * Corresponding author: Yang Liu. et al., 2014), it is hard to interpret each hidden state in neural networks from a linguistic perspective. On the other hand, prior knowledge in machine translation is usually represented in discrete symbolic forms such as dictionaries and rules (Nirenburg, 1989) that explicitly encode translation regularities. It is difficult to transform prior knowledge represented in discrete forms to continuous representations required by neural networks.
Therefore, a number of authors have endeavored to integrate prior knowledge into NMT in recent years, either by modifying model architectures Cohn et al., 2016;Tang et al., 2016;Feng et al., 2016) or by modifying training objectives (Cohn et al., 2016;Feng et al., 2016;. For example, to address the over-translation and under-translation problems widely observed in NMT,  directly extend standard NMT to model the coverage constraint that each source phrase should be translated into exactly one target phrase (Koehn et al., 2003). Alternatively, Cohn et al. (2016) and Feng et al. (2016) propose to control the fertilities of source words by appending additional additive terms to training objectives.
Although these approaches have demonstrated clear benefits of incorporating prior knowledge into NMT, how to combine multiple overlapping, arbitrary prior knowledge sources still remains a major challenge. It is difficult to achieve this end by directly modifying model architectures because neural networks usually impose strong independence assumptions between hidden states. As a result, extending a neural model requires that the interdependence of information sources be modeled explicitly Tang et al., 2016), making it hard to extend. While this drawback can be partly alleviated by appending additional additive terms to training objectives (Cohn et al., 2016;Feng et al., 2016), these terms are restricted to a limited number of simple constraints.
In this work, we propose a general framework for integrating multiple overlapping, arbitrary prior knowledge sources into NMT using posterior regularization (Ganchev et al., 2010). Our framework is capable of incorporating indirect supervision via posterior distributions of neural translation models. To represent prior knowledge sources as arbitrary real-valued features, we define the posterior distribution as a loglinear model instead of a constrained posterior set (Ganchev et al., 2010). This treatment not only leads to a simpler and more efficient training algorithm but also achieves better translation performance. Experiments show that our approach is able to incorporate a variety of features and achieves significant improvements over posterior regularization using constrained posterior sets on NIST Chinese-English datasets.

Neural Machine Translation
Given a source sentence x = x 1 , . . . , x i , . . . , x I and a target sentence y = y 1 , . . . , y j , . . . , y J , a neural translation model (Sutskever et al., 2014;Bahdanau et al., 2015) is usually factorized as a product of word-level translation probabilities: where θ is a set of model parameters and y <j = y 1 , . . . , y j−1 denotes a partial translation. The word-level translation probability is defined using a softmax function: where f (·) is a non-linear function, v y j is a vector representation of the j-th target word y j , v x is a vector representation of the source sentence x that encodes the context on the source side, and v y <j is a vector representation of the partial translation y <j that encodes the context on the target side.
Given a training set { x (n) , y (n) } N n=1 , the standard training objective is to maximize the loglikelihood of the training set: where Although the introduction of vector representations into machine translation has resulted in substantial improvements in terms of translation quality (Junczys-Dowmunt et al., 2016), it is difficult to incorporate prior knowledge represented in discrete symbolic forms into NMT. For example, given a Chinese-English dictionary containing ground-truth translational equivalents such as baigong, the White House , it is non-trivial to leverage the dictionary to guide the learning process of NMT. To address this problem, Tang et al. (2016) propose a new architecture called phraseNet on top of RNNsearch (Bahdanau et al., 2015) that equips standard NMT with an external memory storing phrase tables.
Another important prior knowledge source is the coverage constraint (Koehn et al., 2003): each source phrase should be translated into exactly one target phrase. To encode this linguistic intuition into NMT,  extend standard NMT with a coverage vector to keep track of the attention history.
While these approaches are capable of incorporating individual prior knowledge sources separately, how to combine multiple overlapping, arbitrary knowledge sources still remains a major challenge. This can be hardly addressed by modifying model architectures because of the lack of interpretability in NMT and the incapability of neural networks in modeling arbitrary knowledge sources. Although modifying training objectives to include additional knowledge sources as additive terms can partially alleviate this problem, these terms have been restricted to a limited number of simple constraints Cohn et al., 2016;Feng et al., 2016) and incapable of combining arbitrary knowledge sources.
Therefore, it is important to develop a new framework for integrating arbitrary prior knowledge sources into NMT. Ganchev et al. (2010) propose posterior regularization for incorporating indirect supervision via constraints on posterior distributions of structured latent-variable models. The basic idea is to penalize the log-likelihood of a neural translation model with the KL divergence between a desired distribution that incorporates prior knowledge and the model posteriors. The posterior regularized likelihood is defined as

Posterior Regularization
where λ 1 and λ 2 are hyper-parameters to balance the preference between likelihood and posterior regularization, Q is a set of constrained posteriors: where φ(x, y) is constraint feature and b is the bound of constraint feature expectations. Ganchev et al. (2010) use constraint features to encode structural bias and define the set of valid distributions with respect to the expectations of constraint features to facilitate inference.
As maximizing F (θ, q) involves minimizing the KL divergence, Ganchev et al. (2010) present a minorization-maximization algorithm akin to EM at sentence level: However, directly applying posterior regularization to neural machine translation faces a major difficulty: it is hard to specify the hyper-parameter b to effectively bound the expectation of features, which are usually real-valued in translation (Och and Ney, 2002;Koehn et al., 2003;Chiang, 2005). For example, the coverage penalty constraint  proves to be an essential feature for controlling the length of a translation in NMT. As the value of coverage penalty varies significantly over different sentences, it is difficult to set an appropriate bound for all sentences on the training data. In addition, the minorization-maximization algorithm involves an additional step to find q (t+1) as compared with standard NMT, which increases training time significantly.

Posterior Regularization for Neural
Machine Translation

Modeling
In this work, we propose to adapt posterior regularization (Ganchev et al., 2010) to neural ma-chine translation. The major difference is that we represent the desired distribution as a log-linear model (Och and Ney, 2002) rather than a constrained posterior set as described in (Ganchev et al., 2010): where the desired distribution that encodes prior knowledge is defined as: 1 As compared to previous work on integrating prior knowledge into NMT Cohn et al., 2016;Tang et al., 2016), our approach provides a general framework for combining arbitrary knowledge sources. This is due to log-linear models that offer sufficient flexibility to represent arbitrary prior knowledge sources as features. We tackle the representation discrepancy problem by associating the Q distribution that encodes discrete representations of prior knowledge with neural models using continuous representations learned from data in the KL divergence. Another advantage of our approach is the transparency to model architectures. In principle, our approach can be applied to any neural models for natural language processing.
Our approach also differs from the original version of posterior regularization (Ganchev et al., 2010) in the definition of desired distribution. We resort to log-linear models (Och and Ney, 2002) to incorporate features that have proven effective in SMT. Another benefit of using log-linear models is the differentiability of our training objective (see Eq. (7)). It is easy to leverage standard stochastic gradient descent algorithms to optimize model parameters (Section 3.3).

Feature Design
In this section, we introduce how to design features to encode prior knowledge in the desired dis-tribution.
Note that not all features in SMT can be adopted to our framework. This is because features in SMT are defined on latent structures such as phrase pairs and synchronous CFG rules, which are not accessible to the decoding process of NMT. Fortunately, we can still leverage internal information in neural models that is linguistically meaningful such as the attention matrix a (Bahdanau et al., 2015).
We will introduce a number of features used in our experiments as follows.

Bilingual Dictionary
It is natural to leverage a bilingual dictionary D to improve neural machine translation. Arthur et al. (2016) propose to incorporate discrete translation lexicons into NMT by using the attention vector to select lexical probabilities on which to be focused.
In our work, for each entry x, y ∈ D in the dictionary, a bilingual dictionary (BD) feature is defined at the sentence level: Note that number of bilingual dictionary features depends on the vocabulary of the neural translation model. Entries containing out-of-vocabulary words has to be discarded.

Phrase Table
Phrases, which are sequences of consecutive words, are capable of memorizing local context to deal with word ordering within phrases and translation of short idioms, word insertions or deletions (Koehn et al., 2003;Chiang, 2005). As a result, phrase tables that specify phrase-level correspondences between the source and target languages also prove to be an effective knowledge source in NMT (Tang et al., 2016). Similar to the bilingual dictionary features, we define a phrase table (PT) feature for each entry x,ỹ in a phrase table P: The number of phrase table features also depends on the vocabulary of the neural translation model.

Coverage Penalty
To overcome the over-translation and undertranslation problems widely observed in NMT, a number of authors have proposed to model the fertility (Brown et al., 1993) and converge constraint (Koehn et al., 2003) to improve the adequacy of translation Cohn et al., 2016;Feng et al., 2016;Mi et al., 2016). We follow  to define a coverage penalty (CP) feature to penalize source words with lower sum of attention weights: 2 where a i,j is the attention probability of the j-th target word on the i-th source word. Note that the value of coverage penalty feature varies significantly over sentences of different lengths.

Length Ratio
Controlling the length of translations is very important in NMT as neural models tend to generate short translations for long sentences, which deteriorates the translation performance of NMT for long sentences as compared with SMT .
Therefore, we define the length ratio (LR) feature to encourage the length of a translation to fall in a reasonable range: where β is a hyper-parameter for penalizing too long or too short translations. For example, to convey the same meaning, an English sentence is usually about 1.2 times longer than a Chinese sentence. As a result, we can set β = 1.2. If the length of a Chinese sentence |x| is 10 and the length of an English sentence |y| is 12, then, φ LR (x, y) = 1. If the translation is too long (e.g., |y| = 100), then the feature value is 0.12. If the translation is too short (e.g., |y| = 6), the feature value is 0.5.

Training
In training, our goal is to find a set of model parameters that maximizes the posterior regularized likelihood: Note that unlike the original version of posterior regularization (Ganchev et al., 2010) that relies on a minorization-maximization algorithm to optimize model parameters, our training objective is differentiable with respect to model parameters. Therefore, it is easy to use standard stochastic gradient descent algorithms to train our model.
However, a major difficulty in calculating gradients is that the algorithm needs to sum over all candidate translations in an exponential search space for KL divergence. For example, the partial derivative of J (θ, γ) with respect to γ is given by The KL divergence is defined as where Y(x (n) ) is a set of all possible candidate translations for the source sentence x (n) .
To alleviate this problem, we follow  to approximate the full search space Y(x (n) ) with a sampled sub-space S(x (n) ). Therefore, the KL divergence can be approximated as . (16) Note that the Q distribution is also approximated on the sub-space: .

Search
Given learned model parametersθ andγ, the decision rule for translating an unseen source sentence x is given bŷ The search process can be factorized at the word level:ŷ where V y is the target language vocabulary. Although this decision rule shares the same efficiency and simplicity with standard NMT (Bahdanau et al., 2015), it does not involve prior knowledge in decoding. Previous studies reveal that incorporating prior knowledge in decoding also significantly boosts translation performance (Arthur et al., 2016;. As directly incorporating prior knowledge into the decoding process of NMT depends on both model structure and the locality of features, we resort to a coarse-to-fine approach instead to keep the architecture transparency of our approach. Given a source sentence x in the test set, we first use the neural translation model P (y|x;θ) to generate a k-best list of candidate translation C(x). Then, the algorithm decides on the most probable candidate translation using the following decision rule: log P (y|x;θ) +γ · φ(x, y) . (21)

Setup
We evaluate our approach on Chinese-English translation.
The evaluation metric is caseinsensitive BLEU calculated by the multibleu.perl script. Our training set 3 consists of 1.25M sentence pairs with 27.9M Chinese words and 34.5M English words. We use the NIST 2002 dataset as validation set and the NIST 2003NIST , 2004NIST , 2005NIST , 2006, 2008 datasets as test sets.
In the experiments, we compare our approach with the following two baseline approaches:  (Bahdanau et al., 2015) that does not incorporate prior knowledge. CPR extends RNNSEARCH by introducing coverage penalty refinement (Eq. (11)) in decoding. POSTREG extends RNNSEARCH with posterior regularization (Ganchev et al., 2010), which uses constraint features to represent prior knowledge and a constrained posterior set to denote the desired distribution. Note that POSTREG cannot use the CP feature (Section 3.2.3) because it is hard to bound the feature value appropriately. On top of RNNSEARCH, our approach also exploits posterior regularization to incorporate prior knowledge but uses a log-linear model to denote the desired distribution. All results of this work are significantly better than RNNSEARCH (p < 0.01).
For RNNSEARCH, we use an in-house attention-based NMT system that achieves comparable translation performance with GROUND-HOG (Bahdanau et al., 2015), which serves as a baseline approach in our experiments. We limit vocabulary size to 30K for both languages. The word embedding dimension is set to 620. The dimension of hidden layer is set to 1,000. In training, the batch size is set to 80. We use the AdaDelta algorithm (Zeiler, 2012) for optimizing model parameters. In decoding, the beam size is set to 10.
For CPR, we simply follow  to incorporate the coverage penalty into the beam search algorithm of RNNSEARCH.
For POSTREG, we adapt the original version of posterior regularization (Ganchev et al., 2010) to NMT on top of RNNSEARCH. Following Ganchev et al. (2010), we use a ten-step projected gradient descent algorithm to search for an approximate desired distribution in the E step and a one-step gradient descent for the M step.
Our approach extends RNNSEARCH by incorporating prior knowledge. For each source sentence, we sample 80 candidate translations to approximate theP andQ distributions. The hyperparameter α is set to 0.2. The batch size is 1. The hyper-parameters λ 1 and λ 2 are set to 8×10 −5 and 2.5 × 10 −4 . Note that they not only balance the preference between likelihood and posterior regularization, but also control the values of gradients to fall in a reasonable range for optimization.
We construct bilingual dictionary and phrase table in an automatic way. First, we run the statistical machine translation system MOSES (Koehn and Hoang, 2007) to obtain probabilistic bilingual dictionary and phrase table. For the bilingual dictionary, we retain entries with probabilities higher than 0.1 in both source-to-target and  target-to-source directions. For phrase table, we first remove phrase pairs that occur less than 10 times and then retain entries with probabilities higher than 0.5 in both directions. As a result, both bilingual dictionary and phrase table contain highquality translation correspondences. We estimate the length ratio on Chinese-English data and set the hyper-parameter β to 1.236. By default, both POSTREG and our approach use reranking to search for the most probable translations (Section 3.4). Table 1 shows the BLEU scores obtained by RNNSEARCH, POSTREG, and our approach on the Chinese-English datasets.

Main Results
We find POSTREG achieves significant improvements over RNNSEARCH by adding features that encode prior knowledge. The most effective single feature for POSTREG seems to be the length ratio (LR) feature, suggesting that it is important for NMT to control the length of translation to improve translation quality. Note that POSTREG is unable to include the coverage penalty (CP) feature because the feature value varies significantly over different sentences. It is hard to specify an appropriate bound b for constraining the expected feature value. We observe that a loose bound often makes the training process very unstable and fail to converge. Combining features obtains further modest improvements.
Our approach outperforms both RNNSEARCH and POSTREG significantly. The bilingual dictio-nary (BD) feature turns out to make the most contribution. Compared with CPR that imposes coverage penalty during decoding, our approach that using a single CP feature obtains a significant improvement (i.e., 30.76 over 29.72), suggesting that incorporating prior knowledge sources in modeling might be more beneficial than in decoding.
We find that combining features only results in modest improvements for our approach. One possible reason is that the bilingual dictionary and phrase table features overlap on single word pairs. Table 2 shows the effect of reranking on translation quality. We find that using prior knowledge features to rescore the k-best list produced by the neural translation model usually leads to improvements. This finding confirms that adding prior knowledge is beneficial for NMT, either in the training or decoding process.

Training Speed
Initialized with the best RNNSEARCH model trained for 300K iterations, our model converges after about 100K iterations. For each iteration, our approach is 1.5 times slower than RNNSEARCH. On a single GPU device Tesla M40, it takes four days to train the RNNSEARCH model and three extra days to train our model. Table 3 gives four examples to demonstrate the benefits of adding features. Source lijing liang tian yu bingxue de fenzhan , 31ri shenye 23 shi 50 fen , shanghai jichang jituan yuangong yinglai le 2004nian de zuihou yige hangban . Reference after fighting with ice and snow for two days , staff members of shanghai airport group welcomed the last flight of 2004 at 23 : 50pm on the 31st . RNNSEARCH after a two -day and two -day journey , the team of shanghai 's airport in shanghai has ushered in the last flight in 2004 . + BD after two days and nights fighting with ice and snow , the shanghai airport group 's staff welcomed the last flight in 2004 . Source suiran tonghuopengzhang weilai ji ge yue reng jiang weizhi zai baifenzhier yishang , buguo niandi zhiqian keneng jiangdi . Reference although inflation will remain above 2 % for the coming few months , it may decline by the end of the year . RNNSEARCH although inflation has been maintained for more than two months from the year before the end of the year , it may be lower . + PT although inflation will remain at more than 2 percent in the next few months , it may be lowered before the end of the year . Source qian ji tian ta ganggang chuyuan , jintian jianchi lai yu lao pengyou daobie . Reference just discharged from the hospital a few days ago , he insisted on coming to say farewell to his old friend today . RNNSEARCH during the previous few days , he had just been given treatment to the old friends . + CP during the previous few days , he had just been discharged from the hospital , and he insisted on goodbye to his old friend today . Source ( guoji ) yiselie fuzongli fouren jihua kuojian gelan gaodi dingjudian Reference ( international ) israeli deputy prime minister denied plans to expand golan heights settlements RNNSEARCH ( world ) israeli deputy prime minister denies the plan to expand the golan heights in the golan heights + LR ( international ) israeli deputy prime minister denies planning to expand golan heights Table 3: Example translations that demonstrate the effect of adding features.

Example Translations
In the first example, source words "fenzhan" (fighting), "yuangong" (staff), and "yinglai" (welcomed) are untranslated in the output of RNNSEARCH. Adding the bilingual dictionary (BD) feature encourages the model to translate these words if they occur in the dictionary.
In the second example, while RNNSEARCH fails to capture phrase cohesion, adding the phrase table (PT) feature is beneficial for translating short idioms, word insertions or deletions that are sensitive to local context.
In the third example, RNNSEARCH tends to omit many source content words such as "chuyuan" (discharged from the hospital), "jianchi" (insisted on), and "daobie" (say farewell). The coverage penalty (CP) feature helps to alleviate the word omission problem.
In the fourth example, the translation produced by RNNSEARCH is too long and "the golan heights" occurs twice. The length ratio (LR) feature is capable of controlling the sentence length in a reasonable range.

Related Work
Our work is directly inspired by posterior regularization (Ganchev et al., 2010). The major difference is that we use a log-linear model to represent the desired distribution rather than a constrained posterior set. Using log-linear models not only enables our approach to incorporate arbitrary knowledge sources as real-valued features, but also is differentiable to be jointly trained with neural translation models efficiently.
Our work is closely related to recent work on injecting prior knowledge into NMT (Arthur et al., 2016;Cohn et al., 2016;Tang et al., 2016;Feng et al., 2016;. The major difference is that our approach aims to provide a general framework for incorporating arbitrary prior knowledge sources while keeping the neural translation model unchanged.  also propose to combine the strengths of neural networks on learning representations and log-linear models on encoding prior knowledge. But they treat neural translation models as a feature in the log-linear model. In contrast, we connect the two models via KL divergence to keep the transparency of our approach to model architectures. This enables our approach to be easily applied to other neural models in NLP.

Conclusion
We have presented a general framework for incorporating prior knowledge into end-to-end neural machine translation based on posterior regularization (Ganchev et al., 2010). The basic idea is to guide NMT models towards desired behavior using a log-linear model that encodes prior knowledge. Experiments show that incorporating prior knowledge leads to significant improvements over both standard NMT and posterior regularization using constrained posterior sets.