A Comparison between Count and Neural Network Models Based on Joint Translation and Reordering Sequences

We propose a conversion of bilingual sentence pairs and the corresponding word alignments into novel linear sequences. These are joint translation and reordering (JTR) uniquely deﬁned sequences, combining interdepending lexical and alignment dependencies on the word level into a single framework. They are constructed in a simple manner while capturing multiple alignments and empty words. JTR sequences can be used to train a variety of models. We investigate the performances of n - gram models with modiﬁed Kneser-Ney smoothing, feed-forward and recurrent neural network architectures when estimated on JTR sequences, and compare them to the operation sequence model (Durrani et al., 2013b). Evaluations on the IWSLT German → English, WMT German → English and BOLT Chinese → English tasks show that JTR models improve state-of-the-art phrase-based systems by up to 2 . 2 BLEU.


Introduction
Standard phrase-based machine translation (Och et al., 1999;Zens et al., 2002;Koehn et al., 2003) uses relative frequencies of phrase pairs to estimate a translation model. The phrase table is extracted from a bilingual text aligned on the word level, using e.g. GIZA ++ (Och and Ney, 2003). Although the phrase pairs capture internal dependencies between the source and target phrases aligned to each other, they fail to model dependencies that extend beyond phrase boundaries. Phrase-based decoding involves concatenating target phrases. The burden of ensuring that the result is linguistically consistent falls on the language model (LM).
This work proposes word-based translation models that are potentially capable of capturing long-range dependencies. We do this in two steps: First, given bilingual sentence pairs and the associated word alignments, we convert the information into uniquely defined linear sequences. These sequenecs encode both word reordering and translation information. Thus, they are referred to as joint translation and reordering (JTR) sequences. Second, we train an n-gram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998) on the resulting JTR sequences. This yields a model that fuses interdepending reordering and translation dependencies into a single framework.
Although JTR n-gram models are closely related to the operation sequence model (OSM) (Durrani et al., 2013b), there are three main differences. To begin with, the OSM employs minimal translation units (MTUs), which are essentially atomic phrases. As the MTUs are extracted sentence-wise, a word can potentially appear in multiple MTUs. In order to avoid overlapping translation units, we define the JTR sequences on the level of words. Consequently, JTR sequences have smaller vocabulary sizes than OSM sequences and lead to models with less sparsity. Moreover, we argue that JTR sequences offer a simpler reordering approach than operation sequences, as they handle reorderings without the need to predict gaps. Finally, when used as an additional model in the log-linear framework of phrase-based decoding, an n-gram model trained on JTR sequences introduces only one single feature to be tuned, whereas the OSM additionally uses 4 supportive features (Durrani et al., 2013b). Experimental results confirm that this simplification does not make JTR models less expressive, as their performance is on par with the OSM.
Due to data sparsity, increasing the n-gram order of count-based models beyond a certain point becomes useless. To address this, we resort to neu-ral networks (NNs), as they have been successfully applied to machine translation recently (Sundermeyer et al., 2014;Devlin et al., 2014). They are able to score any word combination without requiring additional smoothing techniques. We experiment with feed-forward and recurrent translation networks, benefiting from their smoothing capabilities. To this end, we split the linear sequence into two sequences for the neural translation models to operate on. This is possible due to the simplicity of the JTR sequence. We show that the count and NN models perform well on their own, and that combining them yields even better results.
In this work, we apply n-gram models with modified Kneser-Ney smoothing during phrasebased decoding and neural JTR models in rescoring. However, using a phrase-based system is not required by the model, but only the initial step to demonstrate the strength of JTR models, which can be applied independently of the underlying decoding framework. While the focus of this work is on the development and comparison of the models, the long-term goal is to decode using JTR models without the limitations introduced by phrases, in order to exploit the full potential of JTR models. The JTR models are estimated on word alignments, which we obtain using GIZA ++ in this paper. The future aim is to also generate improved word alignments by a joint optimization of both the alignments and the models, similar to the training of IBM models (Brown et al., 1990;Brown et al., 1993). In the long run, we intend to achieve a consistency between decoding and training using the introduced JTR models.

Previous Work
In order to address the downsides of the phrase translation model, various approaches have been taken. Mariño et al. (2006) proposed a bilingual language model (BILM) that operates on bilingual n-grams, with an own n-gram decoder requiring monotone alignments. The lexical reordering model introduced in (Tillmann, 2004) was integrated into phrase-based decoding. Crego and Yvon (2010) adapted the approach to BILMs. The bilingual n-grams are further advanced in (Niehues et al., 2011), where they operate on nonmonotone alignments within a phrase-based translation framework. Compared to our JTR models, their BILMs treat jointly aligned source words as minimal translation units, ignore unaligned source words and do not include reordering information. Durrani et al. (2011) developed the OSM which combined dependencies on bilingual word pairs and reordering information into a single framework. It used an own decoder that was based on ngrams of MTUs and predicted single translation or reordering operations. This was further advanced in (Durrani et al., 2013a) by a decoder that was capable of predicting whole sequences of MTUs, similar to a phrase-based decoder. In (Durrani et al., 2013b), a slightly enhanced version of OSM was integrated into the log-linear framework of the Moses system (Koehn et al., 2007). Both the BILM (Stewart et al., 2014) and the OSM (Durrani et al., 2014) can be smoothed using word classes. Guta et al. (2015) introduced the extended translation model (ETM), which operates on the word level and augments the IBM models by an additional bilingual word pair and a reordering operation. It is implemented into the log-linear framework of a phrase-based decoder and shown to be competitive with a 7-gram OSM.
The JTR n-gram models proposed within this work can be seen as an extension of the ETM. Nevertheless, JTR models utilize linear sequences of dependencies and combine the translation of bilingual word pairs and reoderings into a single model. The ETM, however, features separate models for the translation of individual words and reorderings and provides an explicit treatment of multiple alignments. As they operate on linear sequences, JTR count models can be implemented using existing toolkits for n-gram language models, e.g. the KenLM toolkit (Heafield et al., 2013).
An HMM approach for word-to-phrase alignments was presented in (Deng and Byrne, 2005), showing performance similar to IBM Model 4 on the task of bitext alignment.  propose several models which rely only on the information provided by the source side and predict reorderings. Contrastingly, JTR models incorporate target information as well and predict both translations and reorderings jointly in a single framework. Zhang et al. (2013) explore different Markov chain orderings for an n-gram model on MTUs in rescoring. Feng and Cohn (2013) present another generative word-based Markov chain translation model which exploits a hierarchical Pitman-Yor process for smoothing, but it is only applied to induce word alignments. Their follow-up work (Feng et al., 2014) introduces a Markov-model on MTUs, similar to the OSM described above.
Recently, neural machine translation has emerged as an alternative to phrase-based decoding, where NNs are used as standalone models to decode source input. In (Sutskever et al., 2014), a recurrent NN was used to encode a source sequence, and output a target sentence once the source sentence was fully encoded in the network. The network did not have any explicit treatment of alignments. Bahdanau et al. (2015) introduced soft alignments as part of the network architecture. In this work, we make use of hard alignments instead, where we encode the alignments in the source and target sequences, requiring no modifications of existing feed-forward and recurrent NN architectures. Our feed-forward models are based on the architectures proposed in (Devlin et al., 2014), while the recurrent models are based on (Sundermeyer et al., 2014). Further recent research on applying NN models for extended context was carried out in (Le et al., 2012;Auli et al., 2013;Hu et al., 2014). All of these works focus on lexical context and ignore the reordering aspect covered in our work.

JTR Sequences
The core idea of this work is the interpretation of a bilingual sentence pair and its word alignment as a linear sequence of K joint translation and reordering (JTR) tokens g K 1 . Formally, the sequence g K 1 ( f J 1 , e I 1 , b I 1 ) is a uniquely defined interpretation of a given source sentence f J 1 , its translation e I 1 and the inverted alignment b I 1 , where b i denotes the ordered sequence of source positions j aligned to target position i. We drop the explicit mention of ( f J 1 , e I 1 , b I 1 ) to allow for a better readability. Each JTR token is either an aligned bilingual word pair f , e or a reordering class ∆ j j . Unaligned words on the source and target side are processed as if they were aligned to the empty word ε. Hence, an unaligned source word f generates the token f , ε , and an unaligned target word e the token ε, e .
Each word of the source and target sentences is to appear in the corresponding JTR sequence exactly once. For multiply-aligned target words e, the first source word f that is aligned to e generates the token f , e . All other source words f , that are also aligned to e, are processed as if they were aligned to the artificial word σ . Thus, each of these f generates a token f , σ . The same approach is applied to multiply-aligned source Algorithm 1 JTR Conversion Algorithm if e i is unaligned then 7: // align e i to the empty word ε 8: continue 10: // e i is aligned to at least one source word 11: j ← first source position in b i 12: if j = j then 13: // e i is aligned to the same f j as e i−1 14: continue 16: if j = j + 1 then 17: // alignment step is non-monotone 18: // 1-to-1 translation: f j is aligned to e i 20: // generate all other f j that are also 23: // aligned to the current target word e i 24: for all remaining j in b i do 25: // check last alignment step at sentence end 28: if j = J then 29: // last alignment step is non-monotone 30: if j 0 = j + 1 then 42: // non-monotone: add reordering class 43: // translate unaligned predecessors by ε 45: for f ← f j 0 to f j−1 do 46: // non-monotone: add reordering class 49: words. Similar to Feng and Cohn (2013), we classify the reordered source positions j and j by ∆ j j : The reordering classes are illustrated in Figure 1. Figure 1: Overview of the different reordering classes in JTR sequences.

Sequence Conversion
Algorithm 1 presents the formal conversion of a bilingual sentence pair and its alignment into the corresponding JTR sequence g K 1 . At first, g K 1 is initialized by an empty sequence (line 2). For each target position i = 1, . . . , I it is extended by at least one token. During the generation process, we store the last visited source position j (line 4). If a target word e i is • unaligned, we align it to the empty word ε and append ε, e i to the current g K 1 (line 8), • if it is aligned to the same f j as e i−1 , we only add σ , e i (line 14), • otherwise we append f j , e i (line 20) and • in case there are more source words aligned to e i , we additionally append f j , σ for each of these (line 24).
Before a token f j , e i is generated, we have to check whether the alignment step from j to j is monotone (line 16). In case it is not, we have to deal with reorderings (line 34). We define that a token f j−1 , ε is to be generated right before the generation of the token containing f j . Thus, if f j−1 is not aligned, we first determine the contiguous sequence of unaligned predecessors f j−1 j 0 (line 38). Next, if the step from j to j 0 is not monotone, we add the corresponding reordering class (line 43). Afterwards we append all f j 0 , ε to f j−1 , ε . If f j−1 is aligned, we do not have to process unaligned source words and only append the corresponding reordering class (line 49). Figure 2 illustrates the generation steps of a JTR sequence, whose result is presented in Table 1. The alignment steps are denoted by the arrows connecting the alignment points. The first dashed alignment point indicates the ε, , token that is generated right after the Feld, field token. The second dashed alignment point indicates the ein, ε token, which corresponds to the unaligned source word ein. Note, that the ein, ε Figure 2: This example illustrates the JTR sequence g K 1 for a German→English sentence pair including the word-to-word alignment. token has to be generated right before ., . is generated. Therefore, there is no forward jump from Code, code to ., . , but a monotone step to ein, ε followed by ., . .

Training of Count Models
As the JTR sequence g K 1 is a unique interpretation of a bilingual sentence pair and its alignment, the probability p( f J 1 , e I 1 , b I 1 ) can be computed as: The probability of g K 1 can be factorized and approximated by an n-gram model.
Within this work, we first estimate the Viterbi alignment for the bilingual training data using GIZA ++ (Och and Ney, 2003). Secondly, the conversion presented in Algorithm 1 is applied to obtain the JTR sequences, on which we estimate an n-gram model with modified Kneser-Ney smoothing as described in (Chen and Goodman, 1998) using the KenLM toolkit 1 (Heafield et al., 2013).
. .  Figure 2. The right side shows the source and target tokens s k and t k obtained from the JTR tokens g k . They are used for the training of NNs (cf. Section 4).

Integration into Phrase-based Decoding
Basically, each phrase table entry is annotated with both the word alignment information, which also allows to identify unaligned source words, and the corresponding JTR sequence. The JTR model is added to the log-linear framework as an additional n-gram model. Within the phrase-based decoder, we extend each search state such that it additionally stores the JTR model history.
In comparison to the OSM, the JTR model does not predict gaps. Local reorderings within phrases are handled implicitly. On the other hand, we represent long-range reorderings between phrases by the coverage vector and limit them by reordering constraints.
Phrase-pairs ending with unaligned source words at their right boundary prove to be a problem during decoding. As shown in Subsection 3.1, the conversion from word alignments to JTR sequences assumes that each token corresponding to an unaligned source word is generated immediately before the token corresponding to the closest aligned source position to its right. However, if a phrase ends with an unaligned f j as its rightmost source word, the generation of the f j , ε token has to be postponed until the next word f j+1 is to be translated or, even worse, f j+1 has already been translated before.
To address this issue, we constrained the phrase table extraction to discard entries with unaligned source tokens at the right boundary. For IWSLT De→En, this led to a baseline weaker by 0.2 BLEU than the one described in Section 5. In order to have an unconstrained and fair baseline, we thereafter removed this constraint and forced such deletion tokens to be generated at the end of the sequence. Hence, we accept that the JTR model might compute the wrong score in these special cases.

Neural Networks
Usually, smoothing techniques are applied to count-based models to handle unseen events. A neural network does not suffer from this, as it is able to score unseen events without additional smoothing techniques. In the following, we will describe how to adapt JTR sequences to be used with feed-forward and recurrent NNs.
The first thing to notice is the vocabulary size, mainly determined by the number of bilingual word pairs, which constituted atomic units in the count-based models. NNs that compute probability values at the output layer evaluate a softmax function that produces normalized scores that sum up to unity. The softmax function is given by: where o e i and o w are the raw unnormalized output layer values for the words e i and w, respectively, and |V | is the vocabulary size. The output layer is a function of the context e i−1 1 . Computing the denominator is expensive for large vocabularies, as it requires computing the output for all words. Therefore, we split JTR tokens g k and use individual words as input and output units, such that the NN receives jumps, source and target words as input and outputs target words and jumps. Hence, the resulting neural model is not a LM, but a translation model with different input and output vocabularies. A JTR sequence g K 1 is split into its source and target parts s K 1 and t K 1 . The construction of the JTR source sequence s K 1 proceeds as follows: Whenever a bilingual pair is encountered, the source word is kept and the target word is discarded. In addition, all jump classes are replaced by a special token δ . The JTR target sequence t K 1 is constructed similarly by keeping the target words and dropping source words, and the jump classes are also kept. Table 1 shows the JTR source and target sequences corresponding to JTR sequence of Figure 2.
Due to the design of the JTR sequence, producing the source and target JTR sequences is straightforward. The resulting sequences can then be used with existing NN architectures, without further modifications to the design of the networks. This results in powerful models that require little effort to implement.

Feed-forward Neural JTR
First, we will apply a feed-forward NN (FFNN) to the JTR sequence. FFNN models resemble countbased models in using a predefined limited context size, but they do not encounter the same smoothing problems. In this work, we use a FFNN similar to that proposed in (Devlin et al., 2014), defined as: It scores the JTR target word t k at position k using the current source word s k , and the history of n JTR source words. In addition, the n JTR target words preceding t k are used as context. The FFNN computes the score by looking up the vector embeddings of the source and target context words, concatenating them, then evaluating the rest of the network. We reduce the output layer to a shortlist of the most frequent words, and compute word class probabilities for the remaining words.

Recurrent Neural JTR
Unlike feed-forward NNs, recurrent NNs (RNNs) enable the use of unbounded context. Following (Sundermeyer et al., 2014), we use bidirectional recurrent NNs (BRNNs) to capture the full JTR source side. The BRNN uses the JTR target side as well as the full JTR source side as context, and it is given by: This equation is realized by a network that uses forward and backward recurrent layers to capture the complete source sentence. By a forward layer we imply a recurrent hidden layer that processes a given sequence from left to right, while a backward layer does the processing backwards, from right to left. The source sentence is basically split at a given position k, then past and future representations of the sentence are recursively computed by the forward and backward layers, respectively. To include the target side, we provide the forward layer with the target input t k−1 as well, that is, we aggregate the embeddings of the input source word s k and the input target word t k−1 before they are fed into the forward layer. Due to recurrency, the forward layer encodes the parts (t k−1 1 , s k 1 ), and the backward layer encodes s K k , and together they encode (t k−1 1 , s K 1 ), which is used to score the output target word t k . For the sake of comparison to FFNN and count models, we also experiment with a recurrent model that does not include future source information, this is obtained by replacing the term s K 1 with s k 1 in Eq. 5. It will be referred to as the unidirectional recurrent neural network (URNN) model in the experiments.
Note that the JTR source and target sides include jump information, therefore, the RNN model described above explicitly models reordering. In contrast, the models proposed in (Sundermeyer et al., 2014) do not include any jumps, and hence do not provide an explicit way of including word reordering. In addition, the JTR RNN models do not require the use of IBM-1 lexica to resolve multiply-aligned words. As discussed in Section 3, these cases are resolved by aligning the multiply-aligned word to the first word on the opposite side.
The integration of the NNs into the decoder is not trivial, due to the dependence on the target context. In the case of RNNs, the context is unbounded, which would affect state recombination, and lead to less variety in the beam used to prune the search space. Therefore, the RNN scores are computed using approximations instead (Auli et al., 2013;Alkhouli et al., 2015). In (Alkhouli et al., 2015), it is shown that approximate RNN integration into the phrase-based decoder has a slight advantage over n-best rescoring. Therefore, we apply RNNs in rescoring in this work, and to allow for a direct comparison between FFNNs and RNNs, we apply FFNNs in rescoring as well.

Evaluation
We perform experiments on the largescale IWSLT 2013 2 (Cettolo et al., 2014) German→English, WMT 2015 3 German→English and the DARPA BOLT Chinese→English tasks. The statistics for the bilingual corpora are shown in Table 2 (Och and Ney, 2003). We use a standard phrasebased translation system (Koehn et al., 2003). The decoding process is implemented as a beam search. All baselines contain phrasal and lexical smoothing models for both directions, word and phrase penalties, a distance-based reordering model, enhanced low frequency features (Chen et al., 2011), a hierarchical reordering model (HRM) (Galley and Manning, 2008), a word class LM (Wuebker et al., 2013) and an n-gram LM. The lexical and phrase translation models of all baseline systems are trained on all provided bilingual data. The log-linear feature weights are tuned with minimum error rate training (MERT) (Och, 2003) on BLEU (Papineni et al., 2001). All systems are evaluated with MultEval (Clark et al., 2011). The reported BLEU scores are averaged over three MERT optimization runs.
All LMs, OSMs and count-based JTR models are estimated with the KenLM toolkit (Heafield et al., 2013). The OSM and the count-based JTR model are implemented in the phrasal decoder. NNs are used only in rescoring. The 9-gram FFNNs are trained with two hidden layers. The short lists contain the 10k most frequent words, and all remaining words are clusterd into 1000 word classes. The projecton layer has 17 × 100 nodes, the first hidden layer 1000 and the second 500. The RNNs have LSTM architectures. The URNN has 2 hidden layers while the BRNN has one forward, one backward and one additional hidden layer. All layers have 200 nodes, while the output layer is class-factored using 2000 classes. For the count-based JTR model and OSM we tuned the n-gram size on the tuning set of each task. For the full data, 7-grams were used for the IWSLT and WMT tasks, and 8-grams for BOLT. When using in-domain data, smaller n-gram sizes were used. All rescoring experiments used 1000best lists without duplicates.

Tasks description
The domain of IWSLT consists of lecture-type talks presented at TED conferences which are also available online 4 . All systems are optimized on the dev2010 corpus, named dev here. Some of the OSM and JTR systems are trained on the TED portions of the data containing 138K sentences. To estimate the 4-gram LM, we additionally make use of parts of the Shuffled News, LDC English Gigaword and 10 9 -French-English corpora, selected by a cross-entropy difference criterion (Moore and Lewis, 2010). In total, 1.7 billion running words are taken for LM training. The BOLT Chinese→English task is evaluated on the "discussion forum" domain. The 5-gram LM is trained on 2.9 billion running words in total. The in-domain data consists of a subset of 67.8K sentences and we used a set of 1845 sentences for tuning. The evaluation set test1 contains 1844 and test2 1124 sentences. For the WMT task, we used the target side of the bilingual data and all monolingual data to train a pruned 5-gram LM on a total of 4.4 billion running words. We concatenated the newstest2011 and newstest2012 corpora for tuning the systems.

Results
We start with the IWSLT 2013 German→ English task, where we compare between the different JTR and OSM models. The results are shown in Table 3. When comparing the in-domain n-gram JTR model trained using Kneser-Ney smoothing (KN) to OSM, we observe that the n-gram KN JTR model improves the baseline by 1.4 BLEU on both test and eval11. The OSM model performs similarly, with a slight disadvantage on eval11. In comparison, the FFNN of Eq. (4) improves the baseline by 0.7-0.9 BLEU, compared to the slightly better 0.   URNN is that the latter captures the unbounded source and target history that extends until the beginning of the sentences, giving it an advantage over the FFNN. The performance of the URNN can be improved by including the future part of the source sentence, as described in Eq. (5), resulting in the BRNN model. Next, we explore whether the models are additive. When rescoring the n-gram KN JTR output with the BRNN, an additional improvement of 0.6 BLEU is obtained. There are two reasons for this: The BRNN includes the future part of the source input when scoring target words. This information is not used by the KN model. Moreover, the BRNN is able to score word combinations unseen in training, while the KN model uses backing off to score unseen events. When training the KN, FFNN, and OSM models on the full data, we observe less gains in comparison to in-domain data training. However, combining the KN models trained on in-domain and full data gives additional gains, which suggests that although the in-domain model is more adapted to the task, it still can gain from out-of-domain data. Adding the FFNN on top improves the combination. Note here that the FFNN sees the same information as the KN model, but the difference is that the NN operates on the word level rather than the word-pair level. Second, the FFNN is able to handle unseen sequences by design, without the need for the backing off workaround. The BRNN improves the combination more than the FFNN, as the model captures an unbounded source and target history in addition to an unbounded future source context. Combining the KN, FFNN and BRNN JTR models leads to an overall gain of 2.2 BLEU on both dev and test.
Next, we present the BOLT Chinese→English results, shown in Table 4. Comparing n-gram KN JTR and OSM trained on the in-domain data shows they perform equally well on test1, improving the baseline by 0.7 BLEU, with a slight advantage for the JTR model on test2. The feedforward and the recurrent in-domain networks yield the same results in comparison to each other. Training the OSM and JTR models on the full data yields slightly worse results than in-domain training. However, combining the two types of training improves the results. This is shown when adding the in-domain KN JTR model on top of the model trained on full data, improving it by up to 0.4 BLEU. Rescoring with the feed-forward and the recurrent network improves this even further, supporting the previous observation that the n-gram KN JTR and NNs complement each other. The combination of the 4 models yields an overall improvement of 1.2-1.4 BLEU.
Finally, we compare KN JTR and OSM models on the WMT German→English task in Table 5

Analysis
To investigate the effect of including jump information in the JTR sequence, we trained a BRNN using jump classes and another excluding them. The BRNNs were used in rescoring. Below, we demonstrate the difference between the systems: source: wir kommen später noch auf diese Leute zurück . reference: We'll come back to these people later .
Note the German verb "zurückkommen", which is split into "kommen" and "zurück". German places "kommen" at the second position and "zurück" towards the end of the sentence. Unlike German, the corresponding English phrase "come back" has the words adjacent to each other. We found that the system including jumps prefers the correct translation of the verb, as shown in Hypothesis 1 above. The system translates "kommen" to "come", jumps forward to "zurück", translates it to "back", then jumps back to continue translating the word "später". In contrast, the system that excludes jump classes is blind to this separation of words. It favors Hypothesis 2 which is a strictly monotone translation of the German sentence. This is also reflected by the BLEU scores, where we found the system including jump classes outperforming the one without by up to 0.8 BLEU.

Conclusion
We introduced a method that converts bilingual sentence pairs and their word alignments into joint translation and reordering (JTR) sequences. They combine interdepending lexical and alignment dependencies into a single framework. A main advantage of JTR sequences is that a variety of models can be trained on them. Here, we have estimated n-gram models with modified Kneser-Ney smoothing, FFNN and RNN architectures on JTR sequences.
We compared our count-based JTR model to the OSM, both used in phrase-based decoding, and showed that the JTR model performed at least as good as OSM, with a slight advantage for JTR. In comparison to the OSM, the JTR model operates on words, leading to a smaller vocabulary size. Moreover, it utilizes simpler reordering structures without gaps and only requires one log-linear feature to be tuned, whereas the OSM needs 5. Due to the flexibility of JTR sequences, we can apply them also to FFNNs and RNNs. Utilizing two count models and applying both networks in rescoring gains the overall highest improvement over the phrase-based system by up to 2.2 BLEU, on the German→English IWSLT task. The combination outperforms OSM by up to 1.2 BLEU on the BOLT Chinese→English tasks.
The JTR models are not dependent on the phrase-based framework, and one of the longterm goals is to perform standalone decoding with the JTR models independently of phrase-based systems. Without the limitations introduced by phrases, we believe that JTR models could perform even better. In addition, we aim to use JTR models to obtain the alignment, which would then be used to train the JTR models in an iterative manner, achieving consistency and hoping for improved models.