Reference Network for Neural Machine Translation

Neural Machine Translation (NMT) has achieved notable success in recent years. Such a framework usually generates translations in isolation. In contrast, human translators often refer to reference data, either rephrasing the intricate sentence fragments with common terms in source language, or just accessing to the golden translation directly. In this paper, we propose a Reference Network to incorporate referring process into translation decoding of NMT. To construct a reference book, an intuitive way is to store the detailed translation history with extra memory, which is computationally expensive. Instead, we employ Local Coordinates Coding (LCC) to obtain global context vectors containing monolingual and bilingual contextual information for NMT decoding. Experimental results on Chinese-English and English-German tasks demonstrate that our proposed model is effective in improving the translation quality with lightweight computation cost.


Introduction
Neural Machine Translation (NMT) has enjoyed impressive success in most large-scale translation tasks (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014). Typical NMT model to date is a single end-to-end trained deep neural network that encodes the source sentence into a fixed-length vector and generates the words in the target sentence sequentially. The alignment relationship between source and target sentence is learned by the attention mechanism (Bahdanau et al., 2015;Luong et al., 2015).
Though the framework has achieved significant success, one critical concern is that NMT generates translations in isolation, which leads to translation inconsistency and ambiguity arising from * Corresponding author: Jianling Sun. a single source sentence . Recently, there have been few attempts to model the semantic information across sentences. The basic ideas are to store a handful of previous source or target sentences with context vectors (Jean et al., 2017;Wang et al., 2017a) or memory components (Maruf and Haffari, 2018;. However, these methods have several limitations. First, the very short view of the previous sentences (usually one or two sentence(s)) is not sufficient enough to catch long term dependencies across paragraphs and storing detailed translation history is computationally expensive. Second, in the realworld scenarios, input data of MT application is often isolated sentences, such as Google Translate, where no cross-sentence contexts are provided. Moreover, translations generated by such document-level NMT models are not stable, effected by the sentences surrounding the current one to translate.
To address these limitations, we model the semantic information across sentences by mimicking the human translation process. In real scenarios, there will always be sentences or fragments that the translator can understand the meaning but cannot write down the translations directly. The obstacle could be unfamiliar collocation, descriptions in specific language habits and slang. The usual solutions for human are: (1) paraphrasing the sentence in another way, with simpler and more colloquial terms in the source language, and (2) directly referring to the standard translations of the intricate sentence fragments. For example in Table 1, the Chinese word "zaiyu" is not a common expression. A reference can either provide simple Chinese terms such as "daizhe rongyu" or directly offer the corresponding English translation "with honor". Therefore, if a good quality reference book which covers various translation scenes is provided, it can definitely improve the source canjia dongaohui de faguo yundongyuan zaiyu fanhui bali. translation French athletes participating in winter olympics returned to paris with honors. performance of human translators. To be specific, the motivation of this work can be summarized as two aspects corresponding to the two kinds of human reference processes. First, we aim to provide the machine translator with a reference during decoding, which contains all possible source sentence fragments that are semantically similar to the current one. If the system finds it hard to translate the source fragment, it can turn to translate the fragments in the reference. Second, we intend to offer the oracle translations of the current sentence fragments to translate.
In this paper, we propose a novel model namely Reference Network that incorporates the referring process into translation decoding of NMT. Instead of storing the detailed sentences or translation history, we propose to generate representations containing global monolingual and bilingual contextual information with Local Coordinate Coding (LCC) (Yu et al., 2009). Specifically, for solution (1), the hidden states of NMT encoder are coded by a linear combination of a set of anchor points in an unsupervised manner. The anchors are capable to cover the entire latent space of the source language seamlessly. For solution (2), we employ local codings to approximate the mapping from source and target contexts to the current target word with a supervised regression function. The local coding is then fed to the decoder to modify the update of the decoder hidden state. In this way, the translation decoding can be improved by offering the representation of a common paraphrase (Figure 1) or golden target translation ( Figure 2).
We conduct experiments on NIST Chinese-English (Zh-En) and WMT German-Chinese (En-De) translation tasks. The experimental results indicate that the proposed method can effectively exploit the global information and improve the translation quality. The two proposed models significantly outperform the strong NMT baselines by adding only 9.3% and 19.6% parameters respectively.

Neural Machine Translation
Our model is built on the RNN-based NMT (Bahdanau et al., 2015). However, since recurrent architecture is not necessary for our approach, the idea can also be applied to ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et al., 2017). We leave it for future work. Formally, let x = (x 1 , ..., x m ) be a given source sentence and y = (y 1 , ..., y T ) be the corresponding target sentence. NMT generates the target words sequentially by maximizing the probability of translation conditioned on the source sentence: log p(y t |x, y <t ). (1) At each timestep, the generation probability is computed as p(y t |x, y <t ) = softmax(g(e(y t−1 ), s t , c t )), (2) where g is a transformation function that outputs a vocabulary-sized vector, e(y t−1 ) is the embedding of previous target word y t−1 , c t is the source context vector obtained by attention mechanism, and s t is the t-th hidden state of NMT decoder, computed as: where f d is a nonlinear activation. The source context c t is typically a weighted sum of encoder hidden states as: where attention score α ti is the alignment vector of the i-th source word x i and the t-th target word y t : where W α , U α and v α are trainable matrices or vectors. h i is the annotation of x i computed by the NMT encoder.  Figure 1: Framework of NMT with M-RefNet. x i represents the i-th source sentence in the training corpus and |x i | is the length of the sentence. The global context vector c G t can be regarded as a paraphrase of the current source context c t .
According to the above formulations, conventional NMT models translate sentences independently. However, human translators usually tend to seek for reference materials when in trouble. Motivated by such common human behaviors, we propose Reference Network to provide global information as a reference book in two ways. First, the model utilizes all source hidden states to paraphrase current source sentence. Second, the model directly provides the target wordỹ t according to the rest translation samples in the training corpus. Since it is impossible to store all information directly, we leverage local coordinate coding (LCC) to compress the semantics into a latent manifold.

Local Coordinate Coding
With the assumption that data usually lies on the lower dimensional manifold of the input space, the manifold approximation of high dimensional input x can be defined as a linear combination of surrounding anchor points as: where v is an anchor point and γ v is the weight According to the definitions, it is proved in (Yu et al., 2009) that if the anchor points are localized enough, any (l α , l β )-Lipschitz smooth function f (x) defined on a lower dimensional manifold M can be globally approximated by a linear combination of the function values of a set of the anchors C as: with the upper bound of the approximation error: 3 Reference Network In this section, we present our proposed Reference Network (RefNet).

Overview
We propose two models which explore the global information from the training data in different manners as illustrated by Figure 1 and Figure 2. The monolingual reference network (M-RefNet) provides a global source context vector to paraphrase the current context c t based on all other source sentences. To be specific, we train several unsupervised anchors as the bases of the semantic space of source contexts and each source sentence in the training corpus can be represented by a weighted sum of the anchors.
The bilingual reference network (B-RefNet) generates a referable target embedding according to all sentence pairs in the training corpus to guide output sequence generation. Concretely, we formulate the translation process as a mapping from source and target contexts (c t and s t−1 ) to the current target word embedding e(y t ). B-RefNet learns this mapping with a supervised regression function derived from LCC. It should be noted that the corpus from which the reference vectors (c G t or f s (q t )) are learned can be any monolingual or bilingual data, and the translations generated are relatively effected by the quality of the corpus. In this work, we constrain it as the training corpus for convenience and a fair comparison with the related work.

Monolingual Referent Network
In this section, we seek to improve NMT by rephrasing the source sentence. Instead of storing all source contexts, we regenerate the source contexts from a learned manifold with a combination of a fixed number of anchor points. Formally, given any source sequence x with length m in the training samples, let h = (h 1 , ..., h m ) denotes the hidden states generated by the NMT encoder. We firstly obtain the representation of the source sentence h M via a mean-pooling operation. According to the definition of LCC, it can be assumed Here, v j is the j-th anchor point. The coefficient γ j (h M ) is used to measure the weight of anchor point v j corresponding to γ(h M ). In conventional manifold learning methods, γ j (h M ) is generally computed with distance measure. And to achieve localization, the coefficients corresponding to anchor points out of the neighbors of h M are set to zero. However, it is hard to train in deep neural network using stochastic gradient methods. Inspired by the attention mechanism (Bahdanau et al., 2015), we propose to employ an attention layer to obtain the weights: where s(·) is a score function. Here, we propose a tri-nonlinear score function which has been proven especially effective in the experiments: where W s , U s , V s and v s are trainable parameters. • is the element-wise multiplication, and dimension of any anchor point should be the same to h M . To find the optimal anchor point, localization measure (Yu et al., 2009) is employed as the optimization object: Since any source sentence presentation h M can be represented by the linear combination of the anchors, the trained anchor points can be regarded as the bases of the latent space of all source annotations, containing the global contextual information. Therefore, during translation decoding of NMT, we can drop the coefficient γ and rephrase the source sentence only with the anchor points.
Specifically, we apply an attention mechanism between current local contextual information and each anchor point v j to get the global context as: where α G tj is the attention score between current local contexts and the global context, computed as: Once the global context c G t is obtained, we feed it to decoder states: where c t encodes the local contextual information and c G t contains the global monolingual information from all source sentences in the training corpus. When the model has trouble to translate some words or sentence fragments, it can refer to c G t to gain the richer source contextual information.

Bilingual Reference Network
The bilingual model is proposed to improve NMT by providing a golden translations according to rest samples in the training corpora. To be specific, once source context c t and target context s t−1 are obtained, we hope to provide a referable prediction e(ỹ t ) of the current target word embedding e(y t ) according to other sentence pairs in the training data for the decoder.
The functionality of the NMT decoder during translation (Eq.2 and Eq.3) is totally a function that maps the source context c t , target context s t−1 and last target word y t−1 to current target y t . NMT takes it as a classification problem, using tanh or other gated RNN unit to implement this function. In this work, we propose a much stronger model in information expression, that regrades the problem as regression: where g is a transformation function that transforms q t to a anchor-size vector, W and b are the weight matrix and bias vector of the regression function. The weight and bias are allowed to vary according to the input q t , which makes the function capable of mapping each q t to the corresponding e(y t ) precisely. However, it is impossible to store the weight and bias for every q t computed within the training data. Therefore, we approximate the weight and bias function in Eq.18 using local coordinate coding as: where v j ∈ C is an anchor point, W v j and b v j are trainable parameters corresponding to v j , and γ j (q t ) is the weight function, computed as: .
Similar to M-RefNet, the score is computed by the tri-nonlinear function as: Here, f s (q t ) can be regarded as an approximation of e(y t ) based on all the sentence pairs in the training data. Therefore, we feed the function value to the decoder state to guide sentence generation: The optimal weight matrices and anchor points are obtained by minimizing the hinge loss for each sentence pair (x, y) as:

Training and Inference
Stage-wise training strategies have been proven to be efficient when system is relative complicated by plenty of recent work (Maruf and Haffari, 2018;. In this work, we first pre-train a standard NMT on a set of training examples {[x n , y n ]} N n=1 as initialization for training the added parameters in our proposed models. Let θ = {θ E , θ D } denote the parameters of the standard NMT, where θ E and θ D are parameters of the standard encoder and decoder (including attention model) respectively. For M-RefNet, the stage following NMT training is to obtain the weight vectors γ and anchor points C related to all training To train B-RefNet efficiently, we fix the trained parameters of the standard NMT and only update the added parameters θ B including all weight matrices and biases related to local coordinate coding (Eq.19 and Eq.21). The training object is: where λ is a hyper-parameter that balances the preference between likelihood and hinge loss.
During inference, all parameters related to LCC are fixed. Therefore, the work can be regarded as a static approach, compared with the conventional document-level NMT. That means, the final translation is only effected by the reference corpus but not by the sentences surrounding the current one to translate. Naturally, there leaves a question that how it influences the quality of translations when various reference corpus is chosen. We leave it in future work and only use the training corpus in this paper.

Experiments
We evaluate the reference network models on two translation tasks, NIST Chinese-English translation (Zh-En) and WMT English-German translation (En-De).

Settings
Datasets For Zh-En, we choose 1.25M sentence pairs from LDC dataset 1 with 34.5 English words and 27.9M Chinese words. NIST MT02 is chosen as the development set, and NIST MT05/06/08 as test sets. Sentences with more than 50 words are filtered and vocabulary size is limited as 30k. We use case-insensitive BLEU score to evaluate Zh-En translation performance. For En-De, the training set is from (Luong et al., 2015) which contains 4.5M bilingual pairs with 116M English words and 100M German words. BPE (Sennrich et al., 2016) is employed to split the sentence pairs into subwords and we limit the vocabulary as 40k sub-words units. Newstest2012/2013 are chosen for developing and Newsetest2014 for test. casesensitive BLEU 2 is employed as the evaluation metric.
Models We evaluate our RefNet with different structures on Zh-En and En-De. For Zh-En we choose the typical attention-based recurrent NMT model (Bahdanau et al., 2015) as initialization, which consists of a bi-directional RNN-based encoder and a one layer RNN decoder. The dimensions of embedding and hidden state are 620 and 1000 respectively. For En-De, deep linear associative unit model (DeepLAU) (Wang et al., 2017b) is chosen as the base model. Both the encoder and decoder consist of 4-layer LAUs. All embedding and hidden states are 512-dimensional vectors. Moreover, we use layer normalization (Ba et al., 2016) on all layers. For both architectures, the number of anchor points is 100 for M-RefNet and 30 for B-RefNet. The anchor dimension of B-RefNet is set to 100. The hyper-parameter λ in Eq.25 is set to 1. The norm of gradient is clipped to be within [−1, 1] and dropout is applied to embedding and output layer with rate 0.2 and 0.3 respectively. When generating translations, we utilize beam search with beam size 10 on Zh-En and 8 on En-De.

Results on Chinese-English Translation
The standard attention-based NMT model is chosen as the baseline and initialization of our models. Moreover, we also list the results of the open-source Dl4mt and re-implementations of the following related work for comparison: • Cross-sentence context-aware NMT (CS-NMT) (Wang et al., 2017a): A cross-sentence NMT model that incorporates the historical representation of three previous sentences into decoder.
• LC-NMT (Jean et al., 2017): A NMT model that concurrently encodes the previous and current source sentences as context, added to decoder states.
• NMT augmented with a continuous cache (CC-NMT) ): A NMT model armed with a cache 3 which stores the recent translation history.
• Document Context NMT with Memory Networks (DC-NMT) (Maruf and Haffari, 2018): A document-level NMT model that stores all source and target sentence representations of a document to guide translation generating 4 .
All the re-implemented systems share the same settings with ours for fair comparisons.

Main Results
Results on Zh-En are shown in Table 2. The baseline NMT significantly outperforms the opensource Dl4mt by 2.43 BLEU points, indicating the baseline is strong. Our proposed M-RefNet and B-RefNet improve the baseline NMT by 2.34 and 2.69 BLEU respectively and up to 2.90 and 3.17 BLEU on NIST MT06, which confirms the effectiveness of our proposed reference networks. Overall, B-RefNet achieves the best performance over all test sets Compared with the related work which incorporate document-level information NMT, our proposed models still have a significant advantage. Compared to the best performance achieved by the related work (CC-NMT), M-RefNet and B-RefNet outperform it over all test sets and gain improvements of 0.77 BLEU and 1.02 BLEU in average. The possible reason is that all the related work only leverage a small range of the documentlevel information, limited by model complexity 3 Cache size is set to 25. 4 LDC training corpora contains nature boundaries. However document range is not clear for NIST test data. We use clustering and regard each class as a document. Dimension of document context is set to 1024.  Table 3: Statistics of parameters, training speed (sentences/minute) and testing speed (words/second).  and time consuming. In contrast, our models are capable to express all information with more abstract representations. According to the results, though the information is deeply compressed in our models, it is still effective.

Analysis
Parameters and Speed The number of parameters and speed of each model are listed in Table 3. It can be seen that M-RefNet only introduces 6.6M additional parameters while B-RefNet introduces relative larger number of parameters (14M). Considering training process, both M-RefNet and B-RefNet are quite efficient and the training speeds are little slower than CC-NMT, for the added amount of parameters is quite small compared to the baseline NMT and related systems. In terms of decoding, both proposed models do not slow down the translation speed obviously and M-RefNet achieves the fastest speed over all systems except the baseline NMT. The reason is that our models do not incorporate additional previous sentences or interact with extra memory as the relevant document-level systems.  process to fill the memory cells.
Length Analysis We follow (Luong et al., 2015) to group sentences with similar lengths and compute the BLEU score of each group, as shown in Figure 3. The reason for the falling of BLEU in the last group (>50) is that sentences longer than 50 are removed during training. From this figure, we can see that our proposed models outperform the baseline NMT in all ranges of length. Moreover, translations generated by M-RefNet and B-RefNet have more similar lengths to the references compared with the baseline NMT. Table 4 shows the translation examples on Zh-En. In the first case, the Chinese word "lichang" (standpoint, position, or policy) is incorrectly interpreted as "stand on" by NMT. Both M-RefNet and B-RefNet generate legible translations while translation from B-RefNet is more precise. This is because the word pair ("lichang", "policy") appear somewhere in the training data and is leveraged by the systems according to the contexts. This phenomenon is similar in the second case. Translation given by NMT is not readable. In contrast, M-RefNet generates the core verb "strengthened" and B-RefNet provides a more accurate collocation "stepped up patrols".

Results on English-German Translation
On this task, DeepLAU (Wang et al., 2017b) is chosen as the baseline and also used as the pretrained model. We list the translation performance of our models and some existing NMT systems in  (Gehring et al., 2017) and Transformer (Vaswani et al., 2017) which have much deeper architectures with relative much more parameters. Since the reference networks do not rely on the recurrent structure, one interesting future direction is to apply our methods to such complicated models to bring further improvements.

Document-level Neural Machine Translation
There are few works that consider the documentlevel contextual information to improve typical NMT. Jean et al. (2017) propose to use a additional encoder to generate the latent representation of previous sentence as extra context for decoder and attention mechanism is also applied between the decoder state and previous context to get access to word-level information of the previous sentence. Contemporaneously, Wang et al. (2017a)   sentence-level respectively. The last hidden state of encoders are considered as the summarization of a previous sentence and the group. Bawden et al. (2018) employ multiple encoder s to summarize the antecedent and propose to combine the contexts with a gated function. However, these incorporated extra encoders bring large amount of parameters and slow down the translation speed.  propose to modify the NMT with light-weight key-value memory to store the translation history. However, due to the limitation of the memory size, the very short view on the previous (25 timesteps) is not sufficient to model the document-level contextual information. Additionally, Maruf and Haffari (2018) propose to capture the global source and target context of a entire document with memory network (Graves et al., 2014;Wang et al., 2016). Nevertheless, since the number of sentence pairs in a document could be enormous, storing all sentence with memory components could be very time and space consuming. More recently, Miculicich et al. (2018) and  propose to improve Transformer by encoding previous sentences with extra encoders. The reference book in this work can be regarded as a special kind of document context. However, there are two major differences between our approach and the above work. First, we encode the entire corpus into a handful of anchor points which is much more light-weight but concentrated to capture the global contextual information . Second, the global contexts in this work is static. That means, given a sentence to translate, the final translation result only depends on the reference corpus, but not the sentences surrounding the current one.
Local Coding There are a number of works on manifold learning (Roweis and Saul, 2000;Van Gemert et al., 2008;Yu et al., 2009;Ladicky and Torr, 2011). The manifold learning methods approximate any point on the latent manifold with a linear combination of a set of localized anchor points relying on the assumption that high dimensional input usually lies on the lower dimensional manifold. Agustsson et al. (2017) utilize local coding into deep neural networks on age prediction from images and Cao et al. (2018) exploit LCC for GAN (Goodfellow et al., 2014) to capture the local information of data. All these works focus on application of Computer Vision while we apply LCC in a Nature Language Processing task. To our knowledge, this is the first attempt to incorporate local coding into NMT modeling.

Conclusion and Future Work
In this work, we propose two models to improve the translation quality of NMT inspired by the common human behaviors, paraphrasing and consulting. The monolingual model simulates the paraphrasing process by utilizing the global source information while the bilingual model provides a referable target word based on other sentence pairs in the training corpus. We conduct experiments on Chinese-English and English-German tasks, and the experimental results manifest the effectiveness and efficiency of our methods.
In the future, we would like to investigate the feasibility of our methods on non-recurrent NMT models such as Transformer (Vaswani et al., 2017). Moreover, we are also interested in incorporating discourse-level relations into our models.