Neural Hidden Markov Model for Machine Translation

Attention-based neural machine translation (NMT) models selectively focus on specific source positions to produce a translation, which brings significant improvements over pure encoder-decoder sequence-to-sequence models. This work investigates NMT while replacing the attention component. We study a neural hidden Markov model (HMM) consisting of neural network-based alignment and lexicon models, which are trained jointly using the forward-backward algorithm. We show that the attention component can be effectively replaced by the neural network alignment model and the neural HMM approach is able to provide comparable performance with the state-of-the-art attention-based models on the WMT 2017 German↔English and Chinese→English translation tasks.


Introduction
Attention-based neural translation models (Bahdanau et al., 2015;Luong et al., 2015) attend to specific positions on the source side to generate translation. Using the attention component provides significant improvements over the pure encoder-decoder sequence-to-sequence approach (Sutskever et al., 2014) that uses no such attention mechanism. In this work, we aim to compare the performance of attention-based models to another baseline, namely, neural hidden Markov models.
The neural HMM has been successfully applied in the literature on top of conventional phrasebased systems (Wang et al., 2017). In this work, our purpose is to explore its application in standalone decoding, i.e. the model is used to generate and score candidates without assistance from a phrase-based system. Because translation is done standalone using only neural models, we still refer to this as NMT. In addition, while Wang et al. (2017) applied feedforward networks to model alignment and translation, the recurrent structures proposed in this work surpass the feedforward variants by up to 1.3% in BLEU.
By comparing neural HMM and attention-based NMT, we shed light on the role of the attention component. To this end, we use an alignmentbased model that has a recurrent bidirectional encoder and a recurrent decoder, but use no attention component. We replace the attention mechanism by a first-order HMM alignment model. Attention levels are deterministic normalized similarity scores part of the architecture design of an otherwise fully supervised classifier. HMM-style alignments on the other hand are discrete random variables and (unlike attention levels) must be marginalized. Once alignments are marginalized, which is tractable for a first-order HMM, parameters can be estimated to attain a local optimum of log-likelihood of observations as usual.

Motivation
In attention-based approaches, the alignment distribution is used to select the positions in the source sentence that the decoder attends to during translation. Thus the alignment model can be considered as an implicit part of the translation model. On the other hand, separating the alignment model from the lexicon model has its own advantages: First of all, this leads to more flexibility in modeling and training: The models can not only be trained separately, but they can also have different model types, such as neural models, count-based models, etc. Second, the separation avoids propagating errors from one model to another. In attention-based systems, the translation score is based on the alignment distribution, in which errors can be propagated from the alignment part to the translation part. Third, probabilistic treatment to alignments in NMT typically implies an extended degree of interpretability (e.g. one can inspect posteriors) and control over the model (e.g. one can impose priors over alignments and lexical distributions).

Neural Hidden Markov Model
Given a source sentence f J 1 = f 1 ...f j ...f J and a target sentence e I 1 = e 1 ...e i ...e I , where j = b i is the source position aligned to the target position i, we model translation using an alignment model and a lexicon model: Instead of predicting the absolute source position b i , we use an alignment model Wang et al. (2017) applied feedforward neural networks for modeling the lexicon and alignment probabilities. In this work, we would like to model these distributions using recurrent neural networks (RNN). RNNs have been shown to outperform feedforward variants in language and translation modeling. This is mainly due to that RNN can handle arbitrary input lengths and thus include unbounded context information. Unfortunately, the recurrent hidden layer cannot be easily applied for the neural hidden Markov model, since it will significantly complicate the computation of forward-backward messages when running Baum-Welch. Nevertheless, we can apply long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) structure for source and target words embedding. With this technique we can take the essence of LSTM RNN and do not break any sequential generative model assumptions.
Our models are close in structure to the model proposed in Luong et al. (2015), where we have a component that encodes the source sentence, and another that encodes the target sentence. As shown in Figure 1, we use a source side bidi- h j and s i−1 , s i−2 are vectors, W , V and U are weight matrices. Before the non-linear hidden layers, there is a projection layer which f1 · · · fj−1 fj fj+1 concatenates h j , s i−1 and e i−1 . Then the neural network-based lexicon model is given by and the neural network-based alignment model The training criterion is the logarithm of sentence posterior probabilities over training sentence pairs (F r , E r ), r = 1, ..., R: The derivative for a single sentence pair (F, with HMM posterior weights p i (j , j|f J 1 , e I 1 ; θ), which can be computed using the forwardbackward algorithm.
The entire training procedure can be summarized as backpropagation in an EM framework: 1. compute: • the posterior HMM weights • the local gradients (backpropagation)

Decoding
In the decoding stage we still calculate the sum over alignments and apply a target-synchronous beam search for the target string. The auxiliary quantity for each unknown partial string e i 0 is specified as Q(i, j; e i 0 ). During search, the partial hypothesis is extended from e i−1 0 to e i 0 : The decoder is shown in Algorithm 1. In the innermost loop (line 11-13), alignments are hypothesized and used to calculate the auxiliary quantity Q(i, j; e i 0 ). Then for each source position j, the lexical distribution over the full target vocabulary is computed (line 14). The distributions are accumulated (Q(i; e i 0 ) = j Q(i, j; e i 0 ), line 16), then sorted (line 18) and the best candidate translations (arg max e i Q(i; e i 0 )) lying within the beam are used to expand the partial hypotheses (line 19-23). cache is a two-dimensional list of size J × |V src | (source vocabulary size), which is used to cache the current quantities.
Whenever a partial hypothesis in the beam ends with the sentence end symbol (<EOF>), the counter will be increased by 1 (line 26-28). The translation is terminated if the counter reaches the beam size or hypothesis sentence length reaches three times the source sentence length (line 6). If a hypothesis stops but its score is worse than other hypotheses, it is eliminated from the beam, but it still contests non-terminated hypotheses. During comparison the scores are normalized by hypothesis sentence length. Note that we have no explicit coverage constraints. This means that a source position can be revisited many times, whereby creating one-to-many alignment cases. This also allows unaligned source words.
In the neural HMM decoder, word alignments are estimated and scored according to the distribution calculated by the neural network alignment model, leading alignment decisions to become part of the beam search. The search space consists of both alignment and translation decisions. In contrast, the search space in attentionbased decoding consists only of translation decisions.
The decoding complexity is O(J 2 · I) (J = source sentence length, I = target sentence length) return GETBEST(hyps) 33: end function compared to O(J · I) for attention-based models. These are theoretical complexities of decoding on a CPU only considering source and target sentence lengths. In practice, the size of the neural network must also be taken into account, and there are some optimized matrix multiplications for decoding on a GPU. In general, the decoding speed of our model is about 3 times slower than that of a standard attention model (1.07 sentences per second vs. 3.00 sentences per second) on a single GPU. This is still an initial decoder and we did not spend much time on accelerating its decoding yet. The optimization of our decoder would be a promising future work.

Experiments
The experiments are conducted on the WMT 2017 German↔English and Chinese→English translation tasks, which consist of 5M and 23M parallel sentence pairs respectively. Translation quality is measured with the case sensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) metric on newstests 2017, which contain 3004 (German↔English) and 2001 (Chinese→English) sentence pairs.
For German and English preprocessing, we use the Moses tokenizer with hyphen splitting, and perform truecasing with Moses scripts (Koehn et al., 2007). For German↔English subword segmentation , we use 20K joint BPE operations. For the Chinese data, we segment it using the Jieba 1 segmenter. We then learn a BPE model on the segmented Chinese, also using 20K merge operations. During training, sentences with a length greater than 50 subwords are filtered out.

Attention-Based System
The attention-based systems are trained with Sockeye (Hieber et al., 2017), which implement an attentional encoder-decoder with small modifications to the model in Bahdanau et al. (2015). The encoder and decoder word embeddings are of size 620. The encoder consists of a bidirectional layer with 1000 LSTMs with peephole connections to encode the source side. We use Adam (Kingma and Ba, 2015) as optimizer with a learning rate of 0.001, and a batch size of 50. The network is trained with 30% dropout for up to 500K iterations and evaluated every 10K iterations on the development set with BLEU. Decoding is done using beam search with a beam size of 12. In the end the four best models are averaged as described in 1 https://github.com/fxsjy/jieba the beginning of Junczys-Dowmunt et al. (2016).

Neural Hidden Markov Model
The entire neural hidden Markov model is implemented in TensorFlow (Abadi et al., 2016). The feedforward models have three hidden layers of sizes 1000, 1000 and 500 respectively, with a 5word source window and a 3-gram target history. 200 nodes are used for word embeddings.
The output layer of the neural lexicon model consists of around 25K nodes for all subword units, while the neural alignment model has a small output layer with 201 nodes, which reflects that the aligned position can jump within the scope from −100 to 100.
Apart from the basic projection layer, we also applied LSTM layers for the source and target words embedding. The embedding layers have 350 nodes and the size of the projection layer is 800 (400 + 200 + 200, Figure 1). We use Adam as optimizer with a learning rate of 0.001. Neural lexicon and alignment models are trained with 30% dropout and the norm of the gradient is clipped with a threshold 1 (Pascanu et al., 2014). In decoding we use a beam size of 12 and the element-wise average of all weights of the four best models also results in better performance.

Results
We compare the neural HMM approach (Subsection 5.2) with the state-of-the-art attention-based approach (Subsection 5.1) on different translation tasks. The results are presented in Table 1. Compare to the model presented in Wang et al. (2017), switching to LSTM models has a clear advantage, which improves the FFNN-based system by up to 1.3% BLEU and 1.8% TER. It seems that the HMM model benefits from richer features, such as LSTM states, which are very similar to what an attention mechanism would require. We actually WMT   expected it to do with less, the reason being that alignment distributions get refined a posteriori and so they do not have to be as strong a priori. We can also observe that the performance of our approach is comparable with the state-of-the-art attentionbased system with 25M more parameters on all three tasks.

Alignment Analysis
We show an example from the German→English newstest 2017 in Figure 2, along with the attention and alignment matrices. We can observe that the neural network-based HMM could generate a more clear alignment path compared to the attention weights. In this example, it can exactly estimate the alignment positions for words wanted and of.

Discussion
We described a novel formulation for a neural network-based machine translation system, which applied neural networks to the conventional hidden Markov model. The training is end-to-end, the model is monolithic and can be used as a standalone decoder. This results in a more modern and efficient way to use HMM in machine translation and enables neural networks to benefit from HMMs. Experiments show that replacing attention with alignment does not improve the translation performance of NMT significantly. One possible reason is that alignment may fail to capture relevant contexts as attention does. While alignment aims to identify translation equivalents between two lan-guages, attention is designed to find relevant context for predicting the next target word. Source words with high attention weights are not necessarily translation equivalents of the target word. Although using alignment does not lead to significant improvements in terms of BLEU over attention, we think alignment-based NMT models are still useful for automatic post editing and developing coverage-based models. These might be interesting future directions to explore.