University of Rochester WMT 2017 NMT System Submission

We describe the neural machine translation system submitted by the University of Rochester to the Chinese-English language pair for the WMT 2017 news translation task. We applied unsupervised word and subword segmentation techniques and deep learning in order to address (i) the word segmentation problem caused by the lack of delimiters be-tween words and phrases in Chinese and (ii) the morphological and syntactic differences between Chinese and English. We integrated promising recent developments in NMT, including back-translations, language model reranking, subword splitting and minimum risk tuning.


Introduction
This paper presents the machine translation (MT) systems submitted by University of Rochester to the WMT 2017 news translation task. We participated in the Chinese-to-English and Latvian-to-English news translation tasks, but will focus on describing the system submitted for the Chineseto-English task.
Chinese-to-English is a particularly challenging language pair for corpus-based MT systems due to the task of finding an optimal word segmentation for Chinese sentences as well as other linguistic differences between Chinese and English sentences. For example the fact that there may exist multiple possible meanings for characters depending on their context and that individual characters can be joined together to build compound words exacerbate the aforementioned segmentation problem. Additionally, translation performance is also affected by the frequent dropping of subjects and infrequent use of function words in Chinese sentences. We used both word-level and morphological feature-based representations of Chinese to deal with data sparsity and reduce the size of the Chinese vocabulary. We experimented with both subphrase-based and character-based systems. Both RNN-based and 5-gram language models were trained with data extracted from the English news corpora provided and are used to rerank hypotheses proposed by the decoder.
The paper is organized as follows: in Section 2 we introduce our system and preprocessing methods for the Chinese language. Our main learning framework training settings are explained in Section 3. Our NMT, SMT, and submission results are presented in Section 4. The paper ends with some concluding remarks.

System Description
In this section we briefly introduce our preprocessing methods and the general encoder-decoder framework with attention (Sutskever et al., 2014; used in our system. We closely followed the neural machine translation model proposed by Chorowski et al. (2015).
A neural machine translation model (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014) aims at building an endto-end neural network framework, which takes as input a source sentence X = (x 1 , ..., x T X ) with length of T X , and outputs its translation Y = (y 1 , ..., y T Y ) with length of T Y , where x t and y t are the source and target language tokens, respectively. The framework is constructed as a composite of an encoder network and a decoder network.

Morphological Analyzer
Word segmentation is considered an important first step for Chinese natural language processing tasks since individual Chinese words can be composed of multiple characters with no space appearing between words.
We employed the Jieba morphological analyzer (Junyi, 2013) to segment the source Chinese sentences into words. Jieba decomposes Chinese sentences into sequences of words by constructing a graph for all possible word combinations and finds the most probable sequence based on statistics derived from training data. For unknown words, an HMM-based model is used with the Viterbi algorithm.

Rare-Morpheme (BPE) Algorithm
If we simply apply the Chinese morphological analyzer to segment Chinese sentences into individual words and feed the words into our encoder, overfitting will occur; some words are so rare, that they only appear altogether with others. Thus, we enforced a thresholded on frequent words and applied the byte-pair-encoding (BPE) algorithm proposed by Gage (1994) and applied by Sennrich et al. (2016b) to NMT to further reduce the sparsity of our language data and to reduce the number of rare and out-of-vocabulary tokens.

Encoder
The encoder reads a sequence of source language tokens X = (x 1 , . . . , x T X ), and outputs a sequence of hidden states H = (h 1 , . . . , h T X ). A bidirectional recurrent neural network (BiRNN)  consisting of a forward recurrent neural network (RNN) and a backward RNN, is used to give additional positional representational power to the encoder. The lower part of Figure 1 illustrates the BiRNN structure.
The forward network reads the input sentence in a forward direction where for each input token x t , i x (·) : X → R n is a continuous embedding, that maps the t-th input token to a vector i x (x t ) in a high dimensional space R n . A forward recurrent activation function − → φ x updates each forward hidden state − → h t , using the embedded token i x (x t ) and the information of the previous hidden state − − → h t−1 . Similarly, the reverse network reads the sentence in a reverse direction (right to left) and generates a sequence of backward hidden states.
The encoder utilizes information from both the forward RNN and the backward RNN to generate the hidden states H = (h 1 , . . . , h T X ). For every input token x t , we concatenate its corresponding forward hidden state vector and the backward hid-

Decoder
The upper part of Figure 1 illustrates the decoder. The decoder computes the conditional distribution over all possible translations based on the context information provided by the encoder . More specifically, the decoder RNN tries to find a sequence of tokens in the target language that maximizes the following probability: Each hidden state s t in the decoder is updated by where i y is the continuous embedding of a token in the target language. c t is a context vector related to the t-th output token, such that Here, a tl indicates the importance of the hidden state annotation h l regarding to the previous hidden state s t−1 in the decoder RNN. e tk measures how "matching" the input at position k and the output at position t are Chorowski et al., 2015); it is defined by a soft alignment model f align , such that Finally, each conditional probability in Equation 3 is generated by p(y t |y 1 , . . . , y t−1 , X) = g(y t−1 , s t , c t ) (8) for some nonlinear function g.

Attention Mechanism
The soft-alignment mechanism f align weighs each vector in the context set C = (c 1 , . . . , c T Y ) according to its relevance given what has been translated Sutskever et al., 2014). It is commonly implemented as a feedforward neural network with a single hidden layer. This procedure can be understood as computing the alignment probability between the t-th target symbol and k-th source symbol. The hidden state annotation h t , together with the previous target symbol y t−1 and the context vector c t , is fed into a feedforward neural network to result in the conditional distribution and the whole network, consisting of the encoder, decoder and soft-alignment mechanism, is then tuned endto-end to minimize the negative log-likelihood using stochastic gradient descent. In our system, the source sentence X is a sequence of sub-phrase and sub-word tokens extracted by the morphological analyzer and BPE algorithms, and the target sentence Y is represented as a sequence of sub-words.

Minimum Risk Tuning
We applied minimum risk training (Shen et al., 2016) to tune the model parameters post convergence of the cross-entropy loss by minimizing the expected risk for sentence-level BLEU scores where the risk is defined to be P (y|x (s) ; θ)∆(y, y (s) ) (10) for candidate translations Y (x (s) ) for x (s) . Details regarding methods to solve this problem can be found in Shen et al. (2016).

Experimental Settings
In this section, we describe the details of the experimental settings for our system.

Corpora and Preprocessing
Our model was trained on all available training parallel corpora for the ZH-EN language pair. The training data consists of approximately 2, 000, 000 sentence pairs. We removed sentence pairs from our data when the source or target side is more than 50 tokens long. A set of 50, 000, 000 sentences was sampled from the News Crawl 2007-15 data and was used to train our target side (English) language model. Additionally, we backtranslated a subset of these sentences and used the resulting source-target sentences to augment our training data. Our training and development data were lowercased and preprocessed using the Moses tokenizer script (Koehn et al., 2007), Jieba, and BPE. We set the upper bound on the target vocabulary to 30, 000 sub-words and two additional tokens reserved for EOS and U N K . For the source vocabulary, we constrained the size of BPE symbol vocabulary to 30, 000 tokens. Sennrich et al. (2016a) introduced the augmentation of a parallel corpus by leveraging targetside monolingual data and empirically showed that treating back-translations as additional training data reduced overfitting and increased fluency of the translation model. We sampled monolingual sentences from the same news data used to construct our language models. Due to computation and time constraints, we were only able to augment our training data by an additional 190,000 sentence pairs. We hypothesize that increasing the number of back-translated sentences in our training set will further improve our system's performance.

Neural Baseline
Our NMT baseline is an encoder-decoder model with attention and dropout implemented with Nematus (Sennrich et al., 2017) and AmuNMT (Junczys-Dowmunt et al., 2016). This baseline system without pre-tokenization or language model scoring achieves 17.32 uncased BLEU on news-test2017 and 19.78 after sourcesegmentation with the BPE algorithm.
We used beam search with a beam width of 8 to approximately find the most likely translations given a source sentence before introducing features proposed by our language models and reranking with the default Moses (Koehn et al., 2007) implementation of K-best MIRA (Cherry and Foster, 2012). Both language models were trained on the English news data. Our unigrampruned 5-gram language model was trained with KenLM (Heafield, 2011), and our RNN-based language model was trained with RNNLM (Mikolov et al., 2011) with a hidden layer size of 300.

Statistical Baseline
For our SMT baseline, we trained a standard phrase-based system on input segmented with Jieba: Berkeley Aligner (IBM Model 1 and HMM, both for 5 iterations); phrase table with up to 5 tokens per phrase, 40-best translation options per source phrase, and Good-Turing smoothing; 4gram language model and pruning of singleton ngrams; and the default K-best MIRA reordering.
This baseline system achieves an uncased BLEU score of 7.46 on news-test2017.

Experimental Results
We compared the performance of our system to several state-of-the-art algorithms. seen that our system outperformed the baselines, whether using words or subwords as the input tokens. The experiments also showed that the raremorpheme algorithm significantly reduced some potential overfitting, compared to the characterlevel BiRNN.

Error Analysis
Error analysis on the validation set shows that the two main sources of errors produced by the baseline are missing and incorrect words. These issues are addressed in our model by applying morphological segmentation in combination with BPE and adding new backtranslated data to the training set. Our model's translation error rate (0.716) is strictly lower than that of our baseline's output (0.743). We attribute this reduction in error rate to our system being able to more robustly model multi-character words in Chinese.

Conclusion
We describe the University of Rochester neural machine translation system for WMT'17 Chinese-English news translation task, which employs recent developments in the machine translation field. Our results show that applying word and morpheme-aware tokenization, minimum risk tuning, and language model reranking to an existing MT framework help to improve the overall translation quality of the model. Machine translation is a dynamic area, and there are many opportunities for further exploration.
• Other objectives: Modify the encoderdecoder trainer and add secondary tasks for multi-task training (e.g. source sentence tagging) for explicit use of linguistic features.
• Sentence reordering: Reorder the training data in various ways to encourage the model to learn a more robust translation model.
• Source-side monolingual data: Leverage source-side monolingual data to improve translation performance.