NYU-MILA Neural Machine Translation Systems for WMT’16

We describe the neural machine translation system of New York University (NYU) and University of Montreal (MILA) for the translation tasks of WMT’16. The main goal of NYU-MILA submission to WMT’16 is to evaluate a new character-level decoding approach in neural machine translation on various language pairs. The proposed neural machine translation system is an attention-based encoder–decoder with a subword-level encoder and a character-level decoder. The decoder of the neural machine translation system does not require explicit segmentation, when characters are used as to-kens. The character-level decoding approach provides beneﬁts especially when translating a source language into other morphologically rich languages.


Introduction
Word-level modelling with explicit segmentation has been a standard approach in statistical machine translation systems. This is mainly due to the issue of data sparsity, caused by the, exponential growth of the state space as the length of sequences grows larger. This becomes much more severe when a sequence is represented with characters. In addition to the data sparsity issue, in linguistics, words or their segmented-out lexemes are usually considered as basic units of meaning, which makes words to be more suitable when solving natural language processing tasks.
There are however two pressing issues here. The first issue is the absence of a perfect segmentation algorithm for any single language. A perfect This system description paper summarizes and details the experimental procedure described in Chung et al. (2016) segmentation algorithm should be able to segment given unsegmented sentence into a sequence of lexemes and morphemes. The other issue, which is specific to neural network approaches, is that neural machine translation systems suffers from increased complexity due to the large vocabulary size (Jean et al., 2015;Luong et al., 2015), which does not happen with character-level modelling.
Most issues of word-level modelling can be addressed to certain extent by switching into finer tokens, e.g., characters. In fact, to neural networks, each and every token in the vocabulary is treated as an independent entity, and the semantics of tokens are simply learned to maximize the objective function (Chung et al., 2016). This property allows a lot of freedom to the neural machine translation system in the choice of tokens.
The NYU-MILA neural machine translation system is built on the idea of directly generating characters, instead of words, that can possibly unlink a machine translation system from the need of explicit segmentation as a preprocessing step, which is often suboptimal in solving translation tasks. We focus on representing the target sentence as a sequence of characters, and the source sentence as a sequence of subwords (Sennrich et al., 2015).

System Description
In this section, we describe the details of the NYU-MILA neural machine translation system. In our system, we closely follow the neural machine translation model proposed by Bahdanau et al. (2015). A neural machine translation model (Forcada andÑeco, 1997;Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014) aims at building an end-to-end neural network that takes as input a source sentence X = (x 1 , . . . , x Tx ) and outputs its translation Y = (y 1 , . . . , y Ty ), where x t and y t are respectively source and target tokens. The neural network is constructed as a composite of an encoder network and a decoder network.
The encoder maps the input sentence X into its continuous representation. A bidirectional recurrent neural network, which consists of two recurrent neural networks (RNNs), is used to give more representational power to the encoder. The forward network reads the input sentence in a for- is a continuous embedding of the t-th input symbol, and φ is a recurrent activation function. Similarly, the reverse network reads the sentence in a reverse direction (right to left): At each location in the input sentence, we concatenate the hidden states from the forward and reverse RNNs to form a context set: Then the decoder computes the conditional distribution over all possible translations based on this context set. This is done by first rewriting the conditional probability of a translation: log p(Y | X) = Ty t =1 log p(y t | y <t , X). For each conditional term in the summation, the decoder RNN updates its hidden state by where e y is the continuous embedding of a target symbol. c t is a context vector computed by a softalignment mechanism: The soft-alignment mechanism f align weights each vector in the context set C according to its relevance given what has been translated. The weight of each vector z t is computed by where f score is a parametric function returning an unnormalized score for z t given h t −1 and y t −1 . We use a feedforward network with a single hidden layer in this paper. Z is a normalization constant: Z = Tx k=1 e fscore(ey(y t −1 ),h t −1 ,z k ) . This procedure can be understood as computing the alignment probability between the t -th target symbol and t-th source symbol.
The hidden state h t , together with the previous target symbol y t −1 and the context vector c t , is fed into a feedforward neural network to result in the conditional distribution: p(y t | y <t , X) ∝ e f y t out (ey(y t −1 ),h t ,c t ) . (4) The whole network, consisting of the encoder, decoder and soft-alignment mechanism, is then tuned end-to-end to minimize the negative loglikelihood using stochastic gradient descent. In our system, the source sentence X is a sequence of subword tokens extracted by byte-pair-encoding (BPE) (Sennrich et al., 2015), and the target sentence Y is represented as a sequence of characters.

Experimental Settings
In this section, we describe the details of the experimental settings for our system.

Corpora and Preprocessing
We use all available training parallel corpora for four language pairs from WMT'16: En-Cs, En-De, En-Ru and En-Fi. They consist of 63.5M, 4.5M, 2.3M and 2M sentence pairs, respectively. We do not use any monolingual corpus. We only use the sentence pairs, when the source side is up to 50 subword symbols long and the target side is up to 500 characters. For all the pairs other than En-Fi, we use newstest-2013 as a development set, and for En-Fi, we use newsdev-2015 as a development set.
All of the source corpora were preprocessed using BPE (Sennrich et al., 2015), and for the target corpora, no additional preprocessing step is required. For the target vocabulary, we use 300 characters and two additional tokens reserved for EOS and UNK . For the source vocabulary we constrain the size of BPE symbols up to 30, 000.

Models and Training
We use gated recurrent units (Cho et al., 2014) (GRUs) for the recurrent neural networks. The encoder has 512 hidden units for each direction (forward and reverse), and the decoder has two hidden layer with 1024 units each. The embedding layers of both source and target sides have dimensionality of 512 without any non-linearity. Both f out and f score are feedforward neural networks with an intermediate hidden layer with 512 tanh units. We train the model using stochastic gradient descent with Adam (Kingma and Ba, 2014) using the default parameters introduced in the paper. Each update is computed using a minibatch of 128 sentence pairs. The norm of the gradient is rescaled with a threshold set to 1 (Pascanu et al., 2013). We set the initial learning rate of 0.0001. Ensembles We build an ensemble model using eight independent neural machine translation models initialized with different parameters. We decode from an ensemble by taking the average of the output probabilities at each step.
Decoding Speed of the Character-Level Decoder We evaluate the decoding speed of the character-level decoder and compare with a subword-level decoder on newstest-2013 corpus (En-De) with a single Titan X GPU. The subwordlevel decoder generates 31.9 words per second, and the character-level decoder generates 27.5 words per second. Note that this is evaluated in an online setting, where only one sentence is translated at a time, and translating in a batch setting could differ from these results.

Experimental Results
The results of the NYU-MILA system is presented in Table 1. The character-level decoding works well on most of the languages that are tested, achieving comparable BLEU-c scores to other approaches using words or subwords (BPE) as tokens. Note that our system does not incorporate extra monolingual training corpus, and does not include any kind of postprocessing e.g., reranking.

Conclusion
We present the NYU-MILA neural machine translation system for WMT'16, which has a characterlevel decoder on the target side. Our results show that a character-level decoder can perform comparable to state-of-the-art systems. The NYU-MILA neural machine translation system achieved second rank in En-Cs and En-Fi (constrained only) and third rank in En-De. To the best of our knowledge the NYU-MILA system may be the only submitted system that directly generates characters instead of words or subwords. The biggest advantage of the character-level decoding approach is that the machine translation system no longer requires any preprocessing step, such as segmentation.