Forest-Based Neural Machine Translation

Tree-based neural machine translation (NMT) approaches, although achieved impressive performance, suffer from a major drawback: they only use the 1-best parse tree to direct the translation, which potentially introduces translation mistakes due to parsing errors. For statistical machine translation (SMT), forest-based methods have been proven to be effective for solving this problem, while for NMT this kind of approach has not been attempted. This paper proposes a forest-based NMT method that translates a linearized packed forest under a simple sequence-to-sequence framework (i.e., a forest-to-sequence NMT model). The BLEU score of the proposed method is higher than that of the sequence-to-sequence NMT, tree-based NMT, and forest-based SMT systems.


Introduction
NMT has witnessed promising improvements recently. Depending on the types of input and output, these efforts can be divided into three categories: string-to-string systems ; tree-to-string systems (Eriguchi et al., 2016(Eriguchi et al., , 2017; and string-totree systems (Aharoni and Goldberg, 2017;Nadejde et al., 2017). Compared with string-to-string systems, tree-to-string and string-to-tree systems (henceforth, tree-based systems) offer some attractive features. They can use more syntactic information , and can conveniently incorporate prior knowledge . * Contribution during internship at National Institute of Information and Communications Technology.
† Corresponding author Because of these advantages, tree-based methods become the focus of many researches of NMT nowadays.
Based on how to represent trees, there are two main categories of tree-based NMT methods: representing trees by a tree-structured neural network (Eriguchi et al., 2016;Zaremoodi and Haffari, 2017), representing trees by linearization (Vinyals et al., 2015;Dyer et al., 2016;Ma et al., 2017). Compared with the former, the latter method has a relatively simple model structure, so that a larger corpus can be used for training and the model can be trained within reasonable time, hence is preferred from the viewpoint of computation. Therefore we focus on this kind of methods in this paper.
In spite of impressive performance of tree-based NMT systems, they suffer from a major drawback: they only use the 1-best parse tree to direct the translation, which potentially introduces translation mistakes due to parsing errors (Quirk and Corston-Oliver, 2006). For SMT, forest-based methods have employed a packed forest to address this problem (Huang, 2008), which represents exponentially many parse trees rather than just the 1-best one . But for NMT, (computationally efficient) forestbased methods are still being explored 1 .
Because of the structural complexity of forests, the inexistence of appropriate topological ordering, and the hyperedge-attachment nature of weights (see Section 3.1 for details), it is not trivial to linearize a forest. This hinders the development of forest-based NMT to some extent.
Inspired by the tree-based NMT methods based on linearization, we propose an efficient forestbased NMT approach (Section 3), which can en-code the syntactic information of a packed forest on the basis of a novel weighted linearization method for a packed forest (Section 3.1), and can decode the linearized packed forest under the simple sequence-to-sequence framework (Section 3.2). Experiments demonstrate the effectiveness of our method (Section 4).

Preliminaries
We first review the general sequence-to-sequence model (Section 2.1), then describe tree-based NMT systems based on linearization (Section 2.2), and finally introduce the packed forest, through which exponentially many trees can be represented in a compact manner (Section 2.3).

Sequence-to-sequence model
Current NMT systems usually resort to a simple framework, i.e., the sequence-to-sequence model . Given a source sequence (x 0 , . . . , x T ), in order to find a target sequence (y 0 , . . . , y T ) that maximizes the conditional probability p(y 0 , . . . , y T | x 0 , . . . , x T ), the sequence-to-sequence model uses one RNN to encode the source sequence into a fixed-length context vector c and a second RNN to decode this vector and generate the target sequence. Formally, the probability of the target sequence can be calculated as follows: where Here, g, f , and q are nonlinear functions; h t and s t are the hidden states of the source-side RNN and target-side RNN, respectively, c is the context vector, and e t is the embedding of x t .  introduced an attention mechanism to deal with the issues related to long sequences . Instead of encoding the source sequence into a fixed vector c, the attention model uses different c i -s when calculating the target-side output y i at time step i: The function a(s i−1 , h j ) can be regarded as representing the soft alignment between the target-side RNN hidden state s i−1 and the source-side RNN hidden state h j . By changing the format of the source/target sequences, this framework can be regarded as a string-to-string NMT system , a tree-to-string NMT system , or a string-to-tree NMT system (Aharoni and Goldberg, 2017).

Linear-structured tree-based NMT systems
Regarding the linearization adopted for tree-tostring NMT (i.e., linearization of the source side), Sennrich and Haddow (2016) encoded the sequence of dependency labels and the sequence of words simultaneously, partially utilizing the syntax information, while  traversed the constituent tree of the source sentence and combined this with the word sequence, utilizing the syntax information completely. Regarding the linearization used for string-totree NMT (i.e., linearization of the target side), Nadejde et al. (2017) used a CCG supertag sequence as the target sequence, while Aharoni and Goldberg (2017) applied a linearization method in a top-down manner, generating a sequence ensemble for the annotated tree in the Penn Treebank (Marcus et al., 1993). Wu et al. (2017) used transition actions to linearize a dependency tree, and employed the sequence-to-sequence framework for NMT.
It can be seen all current tree-based NMT systems use only one tree for encoding or decoding. In contrast, we hope to utilize multiple trees (i.e., a forest). This is not trivial, on account of the lack of a fixed traversal order and the need for a compact representation.

Packed forest
The packed forest gives a representation of exponentially many parsing trees, and can compactly encode many more candidates than the n-best list [8] [9] [10] [11] (a) Packed forest  Note that the terminal nodes (i.e., words in the sentence) in the packed forest are shown only for illustration, and they do not belong to the packed forest. (Huang, 2008). Figure 1a shows a packed forest, which can be unpacked into two constituent trees ( Figure 1b and Figure 1c). Formally, a packed forest is a pair V, E , where V is the set of nodes and E is the set of hyperedges. Each v ∈ V can be represented as X i,j , where X is a constituent label and i, j ∈ [0, n] are indices of words, showing that the node spans the words ranging from i (inclusive) to j (exclusive). Here, n is the length of the input sentence. Each e ∈ E is a three-tuple head(e), tails(e), score(e) , where head(e) ∈ V is similar to the head node in a constituent tree, and tails(e) ∈ V * is similar to the set of child nodes in a constituent tree. score(e) ∈ R is the logarithm of the probability that tails(e) represents the tails of head(e) calculated by the parser. Based on score(e), the score of a constituent tree T can be calculated as follows: where E(T ) is the set of hyperedges appearing in tree T , and λ is a regularization coefficient for the sentence length 2 .
2 Following the configuration of Charniak and Johnson

Forest-based NMT
We first propose a linearization method for the packed forest (Section 3.1), then describe how to encode the linearized forest (Section 3.2), which can then be translated by the conventional decoder (see Section 2.1).

Forest linearization
Recently, several studies have focused on the linearization methods of a syntax tree, both in the area of tree-based NMT (Section 2.2) and in the area of parsing (Vinyals et al., 2015;Dyer et al., 2016;Ma et al., 2017). Basically, these methods follow a fixed traversal order (e.g., depthfirst), which does not exist for the packed forest (a directed acyclic graph (DAG)). Furthermore, the weights are attached to edges of a packed forest instead of the nodes, which further increase the difficulty.
Topological ordering algorithms for DAG (Kahn, 1962;Tarjan, 1976) are not good solutions, because the outputted ordering is not always optimal for machine translation. In particular, a topo- (2005), for all the experiments in this paper, we fixed λ to log 2 600.
Algorithm 1 Linearization of a packed forest if v has no parent then 9: return v 10: procedure EXPANDSEQ(v, r, V, E , w) 11: for e ∈ E do 12: if head(e) = v then 13: if tails(e) = ∅ then 14: for t ∈ SORT(tails(e)) do Sort tails(e) by word indices. 15: l ← c LINEARIZEEDGES(tails(e), w) c is a unary operator. 19: r.append( l, σ(score(e)) ) 20: logical ordering could ignore "word sequential information" and "parent-child information" in the sentences. For example, for the packed forest in Figure 1a, although "[10]→[1]→[2]→ · · · →[9]→[11]" is a valid topological ordering, the word sequential information of the words (e.g., "John" should be located ahead of the period), which is fairly crucial for translation of languages with fixed pragmatic word order such as Chinese or English, is lost.
As another example, for the packed forest in Figure 1a To address the above two problems, we propose a novel linearization algorithm for a packed forest (Algorithm 1). The algorithm linearizes the packed forest from the root node (Line 2) to leaf nodes by calling the EXPANDSEQ procedure (Line 15) recursively, while preserving the word order in the sentence (Line 14). In this way, word sequential information is preserved. Within the   Figure 1a EXPANDSEQ procedure, once a hyperedge is linearized (Line 16), the tails are also linearized immediately (Line 18). In this way, parent-child information is preserved. Intuitively, different parts of constituent trees should be combined in different ways, therefore we define different operators ( c , ⊗, ⊕, or ) to represent the relationships between different parts, so that the representations of these parts can be combined in different ways (see Section 3.2 for details). Words are concatenated by the operator " " with each other, a word and a constituent label is concatenated by the operator "⊗", the linearization results of child nodes are concatenated by the operator "⊕" with each other, while the unary operator " c " is used to indicate that the node is the child node of the previous part. Furthermore, each token in the linearized sequence is related to a score, representing the confidence of the parser.
The linearization result of the packed forest in Figure 1a is shown in Figure 2. Tokens in the linearized sequence are separated by slashes. Each token in the sequence is composed of different types of symbols and combined by different operators. We can see that word sequential information is preserved. For example, "NNP⊗John" (linearization result of node [1]) is in front of "VBZ⊗has" (linearization result of node [3]), which is in front of "DT⊗a" (linearization result of node [4]). Moreover, parent-child information is also preserved. For example, "NP⊗John" (linearization result of node [2]) is followed by " c NNP⊗John" (linearization result of node [1], the child of node [2]).
Note that our linearization method cannot fully recover packed forest. What we want to do is not to propose a fully recoverable linearization method. What we actually want to do is to encode syntax information as much as possible, so that we can improve the performance of NMT. As will be shown in Section 4, this goal is achieved.
Also note that there is one more advantage of our linearization method: the linearized sequence is a weighted sequence, while all the previous studies ignored the weights during linearization.
As will be shown in Section 4, the weights are actually important not only for the linearization of a packed forest, but also for the linearization of a single tree. By preserving only the nodes and hyperedges in the 1-best tree and removing all others, our linearization method can be regarded as a treelinearization method. Compared with other treelinearization methods, our method combines several different kinds of information within one symbol, retaining the parent-child information, and incorporating the confidence of the parser in the sequence. We examine whether the weights can be useful not only for linear structured tree-based NMT but also for our forest-based NMT.
Furthermore, although our method is nonreversible for packed forests, it is reversible for constituent trees, in that the linearization is processed exactly in the depth-first traversal order and all necessary information in the tree nodes has been encoded. As far as we know, there is no previous work on linearization of packed forests.

Encoding the linearized forest
The linearized packed forest forms the input of the encoder, which has two major differences from the input of a sequence-to-sequence NMT system. First, the input sequence of the encoder consists of two parts: the symbol sequence and the score sequence. Second, each symbol in the symbol sequence consists of several parts (words and constituent labels), which are combined by certain operators ( c , ⊗, ⊕, or ). Based on these observa-tions, we propose two new frameworks, which are illustrated in Figure 3.
Formally, the input layer receives the sequence ( l 0 , ξ 0 , . . . , l T , ξ T ), where l i denotes the i-th symbol and ξ i its score. Then, the sequence is fed into the score layer and the symbol layer. The score and symbol layers receive the sequence and output the score sequence ξ = (ξ 0 , . . . , ξ T ) and symbol sequence l = (l 0 , . . . , l T ), respectively, from the input. Any item l ∈ l in the symbol layer has the form where each x k (k = 1, . . . , m) is a word or a constituent label, m is the total number of words and constituent labels in a symbol, o 0 is " c " or empty, and each o k (k = 1, . . . , m − 1) is either "⊗", "⊕", or " ". Then, in the node/operator layer, the x-s and o-s are separated and rearranged as x = (x 1 , . . . , x m , o 0 , . . . , o m−1 ), which is fed to the pre-embedding layer. The pre-embedding layer generates a sequence p = (p 1 , . . . , p m , . . . , p 2m ), which is calculated as follows: Here, the function I(x) returns a list of the indices in the dictionary for all the elements in x, which consist of words, constituent labels, or operators. In addition, W emb is the embedding matrix of size (|w word | + |w label | + 4) × d word , where |w word | and |w label | are the total number of words and constituent labels, respectively, d word is the dimension of the word embedding, and there are four possible operators: " c ," "⊗," "⊕," and " ." Note that p is a list of 2m vectors, and the dimension of each vector is d word .
Because the length of the sequence of the input layer is T + 1, there are T + 1 different ps in the pre-embedding layer, which we denote by P = (p 0 , . . . , p T ). Depending on where the score layer is incorporated, we propose two frameworks: Score-on-Embedding (SoE) and Score-on-Attention (SoA). In SoE, the k-th element of the embedding layer is calculated as follows: while in SoA, the k-th element of the embedding layer is calculated as where k = 0, . . . , T . Note that e k ∈ R d word . In this manner, the proposed forest-to-string NMT framework is connected with the conventional sequence-to-sequence NMT framework. After calculating the embedding vectors in the embedding layer, the hidden vectors are calculated using Equation 5. When calculating the context vector c i -s, SoE and SoA differ from each other. For SoE, the c i -s are calculated using Equation 6 and 7, while for SoA, the α ij -s used to calculate the c i -s are determined as follows: Then, using the decoder of the sequence-tosequence framework, the sentence of the target language can be generated.

Setup
We evaluate the effectiveness of our forest-based NMT systems on English-to-Chinese and Englishto-Japanese translation tasks 3 . The statistics of the corpora used in our experiments are summarized in Table 1. The packed forests of English sentences are obtained by the constituent parser proposed by Huang (2008) 4 . We filtered out the sentences for 3 English is commonly chosen as the target language. We chose English as the source language because a highperformance forest parser is not available for other languages. 4 http://web.engr.oregonstate.edu/ huanlian/software/forest-reranker/ forest-charniak-v0.8.tar.bz2  For Japanese sentences, we followed the preprocessing steps recommended in WAT 2017 6 . We implemented our framework based on nematus 8 (Sennrich et al., 2017). For optimization, we used the Adadelta algorithm (Zeiler, 2012). In order to avoid overfitting, we used dropout (Srivastava et al., 2014) on the embedding layer and hidden layer, with the dropout probability set to 0.2. We used the gated recurrent unit  as the recurrent unit of RNNs, which are bi-directional, with one hidden layer.
Based on the tuning result, we set the maximum length of the input sequence to 300, the hidden layer size as 512, the dimension of word embedding as 620, and the batch size for training as 40. We pruned the packed forest using the algorithm of Huang (2008), with a threshold of 5. If the linearization of the pruned forest is still longer than 300, then we linearize the 1-best parsing tree instead of the forest. During decoding, we used beam search, and fixed the beam size to 12. For the case of Forest (SoA), with 1 core of Tesla K80 GPU and LDC corpus as the training data, training spent about 10 days, and decoding speed is about 10 sentences per second.  Table 2: English-Chinese experimental results (character-level BLEU). "FS," "TN," and "FN" denote forest-based SMT, tree-based NMT, and forest-based NMT systems, respectively. We performed the paired bootstrap resampling significance test (Koehn, 2004) Table 3: English-Japanese experimental results (character-level BLEU). Table 2 and 3 summarize the experimental results.

Experimental results
To avoid the affect of segmentation errors, the performance were evaluated by character-level BLEU (Papineni et al., 2002). We compare our proposed models (i.e., Forest (SoE) and Forest (SoA)) with three types of baseline: a string-to-string model (s2s), forest-based models that do not use score sequences (Forest (No score)), and tree-based models that use the 1-best parsing tree (1-best (No score, SoE, SoA)). For the 1-best models, we preserve the nodes and hyperedges that are used in the 1-best constituent tree in the packed forest, and remove all other nodes and hyperedges, yielding a pruned forest that contains only the 1-best constituent tree. For the "No score" configurations, we force the input score sequence to be a sequence of 1.0 with the same length as the input symbol sequence, so that neither the embedding layer nor the attention layer are affected by the score sequence.
In addition, we also perform a comparison with some state-of-the-art tree-based systems that are publicly available, including an SMT system  and the NMT systems (Eriguchi et al. (2016) Chen et al. (2017) 10 , and Li et al. (2017)). For , we use the implementation of cicada 11 . For , we reimplemented the "Mixed RNN Encoder" model, because of its outstanding performance on the NIST MT corpus.
We can see that for both English-Chinese and English-Japanese, compared with the s2s baseline system, both the 1-best and forest-based configurations yield better results. This indicates syntactic information contained in the constituent trees or forests is indeed useful for machine translation. Specifically, we observe the following facts.
First, among the three different frameworks SoE, SoA, and No-score, the SoA framework performs the best, while the No-score framework per-9 https://github.com/tempra28/tree2seq 10 https://github.com/howardchenhd/ Syntax-awared-NMT 11 https://github.com/tarowatanabe/ cicada [Source] In the Czech Republic , which was ravaged by serious floods last summer , the temperatures in its border region adjacent to neighboring Slovakia plunged to minus 18 degrees Celsius .  forms the worst. This indicates that the scores of the edges in constituent trees or packed forests, which reflect the confidence of the correctness of the edges, are indeed useful. In fact, for the 1-best constituent parsing tree, the score of the edge reflects the confidence of the parser. By using this information, the NMT system succeed to learn a better attention, paying much attention to the confident structure and not paying attention to the unconfident structure, which improved the translation performance. This fact is ignored by previous studies on tree-based NMT. Furthermore, it is better to use the scores to modify the values of attention instead of rescaling the word embeddings, because modifying word embeddings carelessly may change the semantic meanings of words. Second, compared with the cases that only using the 1-best constituent trees, using packed forests yields statistical significantly better results for the SoE and SoA frameworks. This shows the effectiveness of using more syntactic information. Compared with one constituent tree, the packed forest, which contains multiple different trees, describes the syntactic structure of the sentence in different aspects, which together increase the accuracy of machine translation. However, without using the scores, the 1-best constituent tree is preferred. This is because without using the scores, all trees in the packed forest are treated equally, which makes it easy to import noise into the encoder.
Compared with other types of state-of-the-art systems, our systems using only the 1-best tree (1-best(SoE, SoA)) are better than the other treebased systems. Moreover, our NMT systems using the packed forests achieve the best performance. These results also support the usefulness of the scores of the edges and packed forests in NMT.
As for the efficiency, the training time of the SoA system was slightly longer than that of the SoE system, which was about twice of the s2s baseline. The training time of the tree-based system was about 1.5 times of the baseline. For the case of Forest (SoA), with 1 core of Tesla P100 GPU and LDC corpus as the training data, training spent about 10 days, and decoding speed was about 10 sentences per second. The reason for the relatively low efficiency is that the linearized sequences of packed forests were much longer than word sequences, enlarging the scale of the inputs. Despite this, the training process ended within reasonable time. Figure 4 illustrates the translation results of an English sentence using several different configurations: the s2s baseline, using only the 1-best tree (SoE), and using the packed forest (SoE). This is a sentence from NIST MT 03, and the training corpus is the LDC corpus.

Qualitative analysis
For the s2s case, no syntactic information is utilized, and therefore the output of the system is not a grammatical Chinese sentence. The attributive phrase of "Czech border region" is a complete sentence. However, the attributive is not allowed to be a complete sentence in Chinese.
For the case of using 1-best constituent tree, the output is a grammatical Chinese sentence. However, the phrase "adjacent to neighboring Slovakia" is completely ignored in the translation result. After analyzing the constituent tree, we found that this phrase was incorrectly parsed as an "adverb phrase", so that the NMT system paid little attention to it, because of the low confidence given by the parser.
In contrast, for the case of the packed forest, we can see this phrase was not ignored and was translated correctly. Actually, besides "adverb phrase", this phrase was also correctly parsed as an "adjective phrase", and covered by multiple different nodes in the forest, making it difficult for the encoder to ignore the phrase.
We also noticed that our method performed better on learning attention. For the example in Figure 4, we observed that for s2s model, the decoder paid attention to the word "Czech" twice, which causes the output sentence contains the Chinese translation of Czech twice. On the other hand, for our forest model, by using the syntax information, the decoder paid attention to the phrase "In the Czech Republic" only once, making the decoder generates the correct output.

Related work
Incorporating syntactic information into NMT systems is attracting widespread attention nowadays. Compared with conventional string-to-string NMT systems, tree-based systems demonstrate a better performance with the help of constituent trees or dependency trees.
The first noteworthy study is Eriguchi et al. (2016), which used Tree-structured LSTM (Tai et al., 2015) to encode the HPSG syntax tree of the sentence in the source-side in a bottom-up manner. Then, Chen et al. (2017) enhanced the encoder with a top-down tree encoder.
As a simple extension of Eriguchi et al. (2016), very recently, Zaremoodi and Haffari (2017) proposed a forest-based NMT method by representing the packed forest with a forest-structured neural network. However, their method was evaluated in small-scale MT settings (each training dataset consists of under 10k parallel sentences). In contrast, our proposed method is effective in a largescale MT setting, and we present qualitative analysis regarding the effectiveness of using forests in NMT.
Although these methods obtained good results, the tree-structured network used by the encoder made the training and decoding relatively slow, therefore restricts the scope of application.
Other attempts at encoding syntactic trees have also been proposed. Eriguchi et al. (2017) combined the Recurrent Neural Network Grammar (Dyer et al., 2016) with NMT systems, while  linearized the constituent tree and encoded it using RNNs. The training of these methods is fast, because of the linear structures of RNNs. However, all these syntax-based NMT systems used only the 1-best parsing tree, making the systems sensitive to parsing errors.
Instead of using trees to represent syntactic information, some studies use other data structures to represent the latent syntax of the input sentence. For example, Hashimoto and Tsuruoka (2017) proposed translating using a latent graph. However, such systems do not enjoy the benefit of handcrafted syntactic knowledge, because they do not use a parser trained from a large treebank with human annotations.
Compared with these related studies, our framework utilizes a linearized packed forest, meaning the encoder can encode exponentially many trees in an efficient manner. The experimental results demonstrated these advantages.

Conclusion and future work
We proposed a new NMT framework, which encodes a packed forest for the source sentence using linear-structured neural networks, such as RNN. Compared with conventional string-tostring NMT systems and tree-to-string NMT systems, our framework can utilize exponentially many linearized parsing trees during encoding, without significantly decreasing the efficiency. This represents the first attempt at using a forest under the string-to-string NMT framework. The experimental results demonstrate the effectiveness of our framework.
As future work, we plan to design some more elaborate structures to incorporate the score layer in the encoder. Further improvement in the translation performance is expected to be achieved for the forest-based NMT system. We will also apply the proposed linearization method to other tasks.