CASICT-DCU Neural Machine Translation Systems for WMT17

We participated in the WMT 2016 shared news translation task on English ↔ Chinese language pair. Our systems are based on the encoder-decoder neural machine translation model with the attention mechanism. We employ the Gated Recurrent Unit (GRU) with the linear associative connection to build deep encoder and address the unknown words with the dictionary replace approach. The dictionaries are extracted from the parallel training data with unsupervised word alignment method. In the decoding procedure, the translation probabilities of the target word from different models are averagely combined as the ensemble strategy. In this paper, we introduce our systems from data preprocessing to post-editing in details.


Introduction
We build the Neural Machine Translation systems CASICT-DCU for WMT17 English ↔ Chinese news translation task. Our systems are based on the encoder-decoder model with the attention mechanism, which is also known as the RNNSearch model (Bahdanau et al., 2015). To construct the deep RNN network, we employ the Gated Recurrent Unit (Cho et al., 2014b) with the linear associative connection (Wang et al., 2017) to ensure the fluent gradient propagation. Adadelta (Zeiler, 2012) algorithm is used to optimize the parameters and stochastic gradient descent algorithm with small learning rate is used in the fine-tuning stage. We extract dictionaries from parallel training data with the unsupervised method to address the unknown words in target translation according to the word alignment vector. During the decoding, the ensemble strategy is used to combine the translation probabilities of the target word from different models.

System Description
The neural machine translation model (Kalchbrenner and Blunsom, 2013;Cho et al., 2014b; aims to capture the translation knowledge through training a neural network in the end-to-end style. Our systems are built on the RNNSearch neural machine translation model. Formally, given a source sentence x = x 1 , ..., x m and a target sentence y = y 1 , ..., y n , NMT models the translation probability as where y <t = y 1 , ..., y t−1 . The generation probability of y t is where g(·) is a softmax regression function, y t−1 is the newly translated target word and s t is the hidden states of decoder which represents the translation status. The attention c t denotes the related source words for generating y t and is computed as the weighted-sum of source representation h upon an alignment vector α t shown in Eq.
(3) where the align(·) function is a feedforward network with sof tmax normalization.
The hidden states s t are updated as  where f (·) is a recurrent function.
We adopt a varietal attention mechanism 1 in our system which is implemented as where f 1 (·) and f 2 (·) are recurrent functions.
To construct deep network, we use the linear associative unit (LAU) to ensure fluent gradient propagation. The LAU is computed as where W * is the weight matrices, x t is the input at time t and h t−1 is the hidden states at time t − 1.
The LAU allows the input linearly forward propagates in a certain scale to acquire fluent gradient back propagation. It works like residual connections (He et al., 2016) and fast-forward connections (Zhou et al., 2016) and makes build deep network possible. Our encoder is a 4 layers LAU network where forward LAU and backward LAU are alternately stacked. The general architecture of our systems is shown in Figure 1.

Pipeline Description
We introduce the pipeline of building the translation systems from data preprocessing to post edit-1 https://github.com/nyu-dl/dl4mttutorial/tree/master/session2 ing in this section.

Data Preprocessing
For English ↔ Chinese news translation task, WMT 2017 provides tree parts of data: News Commentary v12, UN Parallel Corpus V1.0 and CWMT Corpus. We used all corpora to train our translation systems. For English sentences, the Moses tokenization script 2 is employed to execute the tokenization processing. For Chinese sentences, we used our in-house word segmentor called "PBCLAS" to do the word segmentation. The word segmentation criterion follows the Chinese People's Daily format. We filter the duplicated sentences and the sentences that are too long (more than 120 words) or too short (less than 5 words). The training corpus is case-sensitive.

Vocabulary
Our systems are based on the words rather than sub-words (Sennrich et al., 2016;Wu et al., 2016). For our system is serially trained on the single GPU with restricted memory space, the source vocabulary size is set to 100,000 and the target vocabulary size is set to 50,000. The words that out of the vocabulary are represented by the "UNK" symbol.

Training Details
The sentence length for training systems is up to 120. The word embedding dimension is set to 512 and the hidden layer size is 512. Square matrices are initialized in a random orthogonal way. Nonsquare matrices are initialized by sampling each element from the Gaussian distribution with mean 0 and variance 0.01 2 . All biases are initialized to 0. Parameters are updated by Mini-batch Gradient Descent and the learning rate is controlled by the AdaDelta algorithm with the decay constant ρ = 0.95 and the denominator constant = 1e−6. The batch size is 80. We use stochastic gradient descent with small learning rates as 0.0001 to fine-tune the models. Dropout strategy (Srivastava et al., 2014) is applied to the output layer with the dropout rate 0.5 to avoid over-fitting. The gradients of the cost function which have L2 norm larger than a predefined threshold 1.0 is normalized to the threshold to avoid gradients explosion (Pascanu et al., 2013). We exploit length normal-ization (Cho et al., 2014a) on candidate translations and the beam size for decoding is 12.

UNK Replace
As the vocabulary sizes are restricted, target sentences may contain "UNK" symbols, which leads to sense ambiguity. We attempt to extract a dictionary to replace the "UNK" symbol in target sentence. We use the "fast align" 3 word alignment tool to generate the word alignment and extract the dictionary through keeping the highest translation probability. We extract English → Chinese and Chinese → English dictionaries in this way.
At the decoding stage of NMT, we regard the source word that possesses highest alignment probability as the one that generates the target word. Once a "UNK" symbol is generated, we locate the corresponding source word and translate it with the dictionary. If the source word is not in the dictionary, it will be presented in the target sentence.

Model Ensemble
To add the diversity of systems, we train several models and combine them with the ensemble strategy. These models are initialized with different weight parameters. Each model produces the probability distribution on the target vocabulary at each step of decoding procedure. These probability distributions are averagely combined as the ultimate distribution for beam searching. For our UNK replace strategy, the word alignment vectors that produced by models are also averagely combined to determine the corresponding source word.

English to Chinese
We ensemble 5 models for English to Chinese translation. The performance of the system on the validation set is presented in Table 1. We figure that the ensemble strategy brings +0.86 BLEU points improvement and the UNK replace approach provide further +1.57 BLEU points.

Chinese to English
We ensemble 6 models for Chinese to English translation.  to Chinese translation, the ensemble and UNK replace approaches can enhance the system performance over a single model. The ensemble strategy improves the system by +0.74 BLEU points and the UNK replace approach achieves further +0.51 BLEU point gain.

Conclusion
We present CASICT-DCU neural machine translation systems for the WMT17 shared news translation task on English ↔ Chinese language pair. The Gated Recurrent Unit (GRU) with the linear associative connection are employed to build the deep encoder. We extract dictionaries from the parallel training data with unsupervised word alignment approach. We locate the source word that generates the "UNK" symbol in target sentence according to the word alignment vector and translate it with the dictionary. In the decoding procedure, the translation probabilities of the target word from different models are averagely combined as the ensemble strategy to further improve the performance. 430