Towards Bidirectional Hierarchical Representations for Attention-based Neural Machine Translation

This paper proposes a hierarchical attentional neural translation model which focuses on enhancing source-side hierarchical representations by covering both local and global semantic information using a bidirectional tree-based encoder. To maximize the predictive likelihood of target words, a weighted variant of an attention mechanism is used to balance the attentive information between lexical and phrase vectors. Using a tree-based rare word encoding, the proposed model is extended to sub-word level to alleviate the out-of-vocabulary (OOV) problem. Empirical results reveal that the proposed model significantly outperforms sequence-to-sequence attention-based and tree-based neural translation models in English-Chinese translation tasks.


Introduction
Neural machine translation (NMT) automatically learns the abstract features of and semantic relationship between the source and target sentences, and has recently given state-of-the-art results for various translation tasks (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015). The most widely used model is the encoder-decoder framework (Sutskever et al., 2014), in which the source sentence is encoded into a dense representation, followed by a decoding process which generates the target translation. By exploiting the attention mechanism (Bahdanau et al., 2015), the generation of target words is conditional on the source hidden states, rather than on the context vector alone. From a model architecture perspective, prior studies of the attentive * Corresponding author encoder-decoder translation model are mainly divided into two types.
The sequence-to-sequence model treats a sentence as a sequence of tokens. The most fundamental approaches transform the source sentence sequentially into a fixed-length context vector, and the annotation vector of each word summarizes the preceding words (Sutskever et al., 2014;Cho et al., 2014b). Although Bahdanau et al. (2015) used a bidirectional recurrent neural network (RNN) (Schuster and Paliwal, 1997) to consider preceding and following words jointly, these sequential representations are insufficient to fully capture the semantics of a sentence, due to the fact that they do not account for the syntactic interpretations of sentence structure (Eriguchi et al., 2016;Tai et al., 2015). By incorporating additional features into a sequential model,  and Stahlberg et al. (2016) suggest that a greater amount of linguistic information can improve the translation performance.
The tree-to-sequence model encodes a source sentence according to a given syntactic tree over the sentence. The existing tree-based encoders (Tai et al., 2015;Eriguchi et al., 2016;Zhou et al., 2016) recursively generate phrase (sentence) representations in a bottom-up fashion, whereby the annotation vector of each phrase is derived from its constituent sub-phrases. As a result, the learned representations are limited to local information, while failing to capture the global meaning of a sentence. As illustrated in Figure 1, the phrases "take up" 1 and "a position" 2 have different meanings in different contexts. However, in composing the representations h VP 3 and h NP 7 for phrases VP 3 and NP 7 , the current approaches do not account for the differences in meaning which arise as a result of ignoring the neighboring context as well as the remote context, i.e. h NP 7 ← h PP 8 (sibling) and h VP 3 ← h NP 7 (child of sibling). More specifically, at the encoding step t, the generated phrase is based on the results at the previous time steps h t−1 and h t−2 , but has no information about the parent phrases h t for t > t.
To address the above problems, we propose a novel architecture, a bidirectional hierarchical encoder, which extends the existing attentive treestructured models (Eriguchi et al., 2016). In contrast to the model of Eriguchi et al. (2016), we first use a bidirectional RNN (Schuster and Paliwal, 1997) at lexical level to concatenate the forward and backward states as the hidden states of source words, to capture the preceding and following contexts (described in Section 3.1). Secondly, we propose a bidirectional tree-based encoder (described in Section 3.2), in which the original bottom-up encoding model is extended using an additional top-down encoding process. In the bidirectional hierarchical model, the vector representations of the sentence, phrases as well as words, are therefore based on the global context rather than local information.
To effectively leverage hierarchical representations in generating the target words, we adopt a variant weighted tree-based attention mechanism (described in Section 3.4) in which a timedependent gating scalar is used to control the proportion of conditional information between the word and phrase vectors. To alleviate the out-ofvocabulary (OOV) problem, we further extend the proposed tree-based model to the sub-word level by integrating byte-pair encoding (BPE)  into the tree-based model (as described in Section 3.3). Experimental results for the NIST English-to-Chinese translation task reveal that the proposed model significantly outperforms the vanilla tree-based (Eriguchi et al., 2016) and sequential NMT models (Bahdanau et al., 2015) (Section 4.1).

Tree-Based Neural Machine Translation
A neural machine translation system (NMT) aims to use a single neural network to build a translation model, which is trained to maximize the conditional distribution of sentence pairs using a parallel training corpus (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014b,a). By incorporating syntactic information, the treebased NMT exploits an additional syntactic structure of the source sentence to improve the translation. Since most existing NMTs generate one target word at a time, given a source sentence x = (x 1 , ..., x N ) and its corresponding syntactic tree tr, the conditional probability of a target sentence y = (y 1 , ..., y M ) is formally expressed as: where θ represents the model parameters. A treebased NMT consists of a tree-based encoder and a decoder.

Tree-Based Encoder
In a tree-based encoder, the source language x is encoded according to a given syntactic structure tr of the sentence. As shown in Figure 2, Eriguchi et al. (2016) employed a forward Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997;Gers et al., 2000) recurrent neural network (RNN) to encode the lexical nodes and a tree-LSTM (Tai et al., 2015) to generate the phrase representations in a bottom-up fashion. In the present study, we utilize the gated recurrent unit (GRU) (Cho et al., 2014b) instead of an LSTM, in view of its comparable performance (Chung et al., 2014) and since it yields even better results for certain tasks (Józefowicz et al., 2015). The lexical annotation vectors (h l 1 , ..., h l N ) are sequentially generated by using a GRU. The i-th leaf node vector is calculated as: where x i is the i-th source word embedding and h l i−1 denotes the previous hidden state. The parent hidden state h ↑ i,j summarizes its left child h ↑ i,k and right child h ↑ k+1,j (i < k < j) by applying the tree-GRU (Zhou et al., 2016) as follows: where z ↑ i,j is the update gate; r ↑ i,k , r ↑ k+1,j are the reset gates for the left and right children; h ↑ i,j denotes the candidate activation; U L (·) and U R (·) represent weight matrices; b ↑ (·) denote bias vectors; σ is the logistic sigmoid function; and the operator denotes element-wise multiplication between vectors. The phrase representations are recursively built in an upward direction.

Decoding with a Tree-Based Attention Mechanism
In generating the target words, we employ a sequential decoder with an input-feeding method (Luong et al., 2015) and attention mechanism (Bahdanau et al., 2015). The conditional probability of the j-th target word y j is calculated using a non-linear function f sof tmax : where c j is the composite hidden state, which consists of a target hidden state s j and a context vector d j : Given the previous target word y j−1 , the concatenation of the previous hidden state s j−1 and the previous context vector c j−1 (input-feeding) (Luong et al., 2015), s j , is calculated using a standard sequential GRU network: . The context vector d j is computed using an attention model which is used to softly summarize the attended part of the source-side representations. Eriguchi et al. (2016) adopted a tree-based attention mechanism to consider both the word and phrase vectors: where h l i is the i-th hidden state of the source word at leaf level, and h p k is the k-th hidden state of the source phrase. The weight α j (t) of node t is computed by: where h t is the hidden state of the node. V a , U a , W a and b a are the model parameters. Figure 3: A top-down encoding process updates the hidden states recursively from root to leaf nodes. The red and blue lines denote the use of different learning parameters.

Bidirectional Leaf-Node Encoding
As discussed in Section 1, the unidirectional recurrent neural network reads an input sequence in order, from the first symbol to the last. In order to generate leaf node annotation vectors which jointly take into account both preceding and following annotations, we exploit a bidirectional RNN encoder (Bahdanau et al., 2015). The hidden state of the i-th leaf node h l i is the concatenation of the forward and backward vectors: where − → h l i is obtained by a rightward GRU, as shown in Equation 1, and a leftward GRU calculates ← − h l i , as follows: where ← − h l i−1 is the previous hidden state.

Bidirectional Tree-Node Encoding
Since the hidden states of leaf nodes are derived in a sequential, context-sensitive way, by generating phrase annotations in a bottom-up fashion, the sequential context can be propagated to tree nodes. However, the learned annotation vectors still fail to capture global information from the upper nodes. To enhance the representations with global semantic information, we propose to use a standard GRU recurrent network to update representations in a top-down fashion, as shown in Figure 3. The annotation vectors, which are learned by the previous encoding steps, are fed to the updating process. First, we treat the bottom-up hidden state of root h ↑ root , which covers the global meaning as well as the syntactic information of the source sentence, as the initial state of the top-down GRU network: Given an updated hidden state of the parent node h ↓ i,j , the hidden states of left and right children h ↓ i,k and h ↓ k+1,j are calculated as: where h ↑ i,k and h ↑ k+1,j are the left and right child annotation vectors generated via the bottomup tree-GRU network. Contrary to the similar top-down encoding for sentiment classification (Kokkinos and Potamianos, 2017), which uses same weighting parameters to handle both left and right child nodes, f ld GRU and f rd GRU with different parameters are applied in the proposed model to distinguish the left and right structural information. According to the definition of a GRU (Cho et al., 2014b), f ld GRU uses an update gate z ↓ i,k , a reset gate r ↓ i,k and a candidate activation h ↓ i,k to generate h ↓ i,k , as follows: where W ld (·) and U ld (·) represent weight matrices, and b ld (·) denote bias vectors. f rd GRU is defined in a similar way.
From a linguistic point of view, in the top-down GRU network, the reset gate is able to retain the useful global information and drop irrelevant information from the parent state h ↓ i,j , while the proportions of the global context from the top-down state h ↓ i,j , and the local context from the bottomup state h ↑ i,k are controlled by the update gate. As it covers both the partial meaning of the phrase and the whole meaning of the sentence, h ↓ i,k is regarded as the final representation of node i,k : With the propagation of information from root to leaf nodes, the i-th leaf node representation is updated as: As each source-side hidden state of the leaf nodes and tree nodes carries the hierarchical information of the sentence, we interpret such an encoded state as a hierarchical representation.

Handling Out-of-Vocabulary: Tree-Based Rare Word Encoding
In NMT, the translation of rare words and unknown words is an open problem, since the computational cost increases with the size of the vocabulary.  proposed a simple and effective approach to handling out-ofvocabulary by representing rare words as a sequence of sub-word units, which are segmented using byte-pair encoding (BPE) (Gage, 1994). We propose a variant tree-based rare word encoding approach which extends the tree-based model to the sub-word level. Sub-word units are encoded following an additional binary lexical tree. For a sentence x = (x 1 , ..., x i , ..., x N ), BPE segments the word x i into a sequence of sub-word units (x 1 i , ..., x n i ). The binary lexical tree is simply built by composing two nodes in a rightwards fashion, (((x 1 i , x 2 i ), x 3 i )...), x n i ), as shown in Figure 4. From the i-th leaf node, the original syntactic tree is extended downwards using the binary lexical tree, and the set of leaf nodes are replenished as x = (x 1 , ..., x 1 i , x 2 i , ..., x n i , ..., x N ). Subword units can therefore be regarded as leaf nodes, and can be encoded using the proposed encoder, as illustrated in Figure 5. The experimental results in Section 4.1 demonstrate the effectiveness of this simple approach.

Decoder with Weighted Variant of Attention Mechanism
Since each representation carries both local and global information, in this case, attending fairly to the lexical and phrase representations in each decoding step may cause the problem of overtranslation (repeatedly attending and translating the same constituent of a sentence). An alternative approach is to balance the attentive information between the lexical and phrase vectors in the context vector. To effectively leverage these hierarchical representations, we propose a weighted variant of the tree-based attention mechanism (the original is defined in Equation 2). Formally, the calculation of the context vector d j at step j is modified as: where β j ∈ [0, 1] is used to weight the expected importance of the representations. Inspired by work on a multi-modal NMT (Calixto et al., 2017) which exploits a gating scalar (Xu et al., 2015) to weight the image context vector, we use such a scalar in our model in order to dynamically adapt the weighting scalar. The gating scalar β j at step j is calculated by : where W β and b β represent the model parameters. In contrast with α, which denotes the correspondence between each source annotation and the current target hidden state, β is dominated by the target composite hidden state alone. In other words, β is a time-dependent scalar in relation to the current target word, and therefore enables the attention model to explicitly quantify how far the leaf and no-leaf states contribute to the word prediction at each time step. In the proposed model, the phrase and lexical context vectors are learned by a single attention model, meaning that they are dependent, and the gating scalar weights the phrase and lexical context vectors in complementary fashion, as shown in Equation 4. This distinguishes the model from that introduced by Calixto et al. (2017), in which the context vectors of the source sentence and image (bi-modal) are measured using two independent attention models and the gating scalar is merely used to weight the image context vector.

Data
Training Dev Test LDC En-Zh mt08 mt04 mt05 mt06 1,435,575 1,357 1,788 1,082 1,664 We evaluate the proposed model on an Englishto-Chinese translation task. For reasons of computational efficiency, we extracted 1.4M sentence pairs, in which the maximum length of the sentence was 40, from the LDC parallel corpus 3 as our training data. The models were developed using NIST mt08 data and were examined using NIST mt04, mt05, and mt06 data. The number of sentences in each dataset is shown in Table 1. On the English side, we used the constituent parser (Zeng et al., 2014(Zeng et al., , 2015 to produce a binary syntactic tree for each sentence, in constrast to the use of the HPSG parser by Eriguchi et al. (2016). On the Chinese side, the sentences are segmented using the Chinese word segmentation toolkit of NiuTrans (Xiao et al., 2012).
To avoid data sparsity, words referring to time, date and number, which are low in frequency, are generalized as '$time', '$date' and '$number'. In addition, as described in Section 3.3, the vocabularies are further compressed by segmenting the rare words into sub-word units using BPE.

Experimental Settings
As shown in Table 2, which gives the statistics of the token types, we limit the source and target vo-  cabulary size to 40,000, in order to cover all the English and Chinese tokens. The dimensions of word embedding and hidden layer are respectively set as 620 and 1,000. Due to the concatenation in the bidirectional leaf-node encoding, the dimensions of the forward and backward vectors, which are half of those of the other hidden states, are set to 500. In order to prevent over-fitting, the training data is shuffled following each epoch. Moreover, the model parameters are optimized using AdaDelta (Zeiler, 2012), due to its capability for dynamically adapting the learning rate. We set the mini-batch size to 16 and the beam search size to 5. The accuracy of the translation relative to a reference is assessed using the BLEU metric (Papineni et al., 2002). In order to give an equitable comparison, all the NMT models used for comparison are implemented or re-implemented using GRU in our code, based on dl4mt 4 .

Enhanced Hierarchical Representations
Firstly, the effectiveness of the enhanced hierarchical representations is evaluated through a set of experiments, the results of which are summarized in Table 3. Compared with the original tree-based encoder (Eriguchi et al., 2016), the model with bidirectional leaf-node encoding (described in Section 3.1) shows better performance. This also reveals that the future context at leaf level can contribute to word prediction. Secondly, although the representations of leaf nodes are learned in a sequential, context-sensitive way, the translation quality is further improved by considering the global semantic information in the top-down encoding (Section 3.2).
By incorporating the above enhancements into the model, the proposed hierarchical encoder yields significant improvements over both the sequential and the tree-based models. The problem of OOV is alleviated by further extending the tree-  Table 3: Translation results for the various models. The first column shows the models; the second column indicates whether the corresponding experiment uses BPE data. The number of parameters (M = millions) in each model is given in the third column. The remaining columns are the translation accuracies for the test sets and development set, evaluated using BLEU scores (%). "↑ / ⇑": indicates that the hierarchical encoder is significantly better than the vanilla tree-based encoder (p < 0.05/p < 0.01).
based model to sub-word level (Section 3.3). In addition, we evaluate our tree-based rare word encoding method against the conventional rare word encoding  using the sequential encoder (Bahdanau et al., 2015). The empirical results confirm that our proposed tree-based BPE method achieves performance comparable to that of the standard BPE in the sequential model, but is applicable to the tree-based NMT model. Overall, the proposed hierarchical encoder has demonstrated the ability to effectively model source-side representations from both the sequential and structural context. The NMT systems based on the proposed model significantly outperform those of conventional models using the sequential encoder and the tree-based encoder.

Weighted Attention Model
As discussed in Section 3.4, in order to effectively leverage hierarchical representations in generating the target word, we adopt a variant weighted treebased attention mechanism which incorporates a scalar to control the proportion of conditional information between the word and phrase vectors. By manually or automatically varying the weight β, the utilization of the weighted attention model is assessed for four cases: • β = 0.0: We manually set the weight of phrase vectors to 0.0; in other words, the decoder is forced to ignore the phrase vectors. The final translation is therefore generated by merely summarizing the leaf vectors.
• β = 0.5: The representations of non-leaf nodes and leaf nodes participate equally in the translation process. The decoder of this case therefore employs the same attention mechanism as that of the original model (Section 2.2).
• β = 1.0: In the reverse of the first case, the weight of the leaf nodes is manually set to 0.0. Thus, only the phrase vectors are used to predict the target words.
• Gating scalar (GS): A gating scalar is used for dynamically learning to control the proportion in which the lexical and phrase contexts contribute to the generation of the target words (Section 3.4).  The experimental results are shown in Table 4. The model which attends only to lexical annotation vectors (β = 0.0) gives slightly better performance than that which uses equal weights for  Figure 6: Translations of an English sentence output using the NMT models with bidirectional hierarchical model (our), sequential encoder (seq-enc) and original tree-based encoder (tr-enc). Ref indicates the reference Chinese sentence. The attention scores (α), which are noted over the source-side syntactic tree, are output by the bidirectional hierarchical model at the step where the fourth target word "在" is translated. The sequence of scores β denote the value of the gating scalar at each translation step.
lexical and phrase vectors (β = 0.5). The use of global information contributes to distinguishing the differences between word meanings, although the similar semantic information in the lexical and phrase representations aggravates the over-translation problem observed in the translation results. However, we found that the model which attends only to phrase representations tends to generate shorter translation of an average of 21.13 words in length, as shown in the last column of the first row of Table 4. Furthermore, the model that neglects the leaf representations (β = 1.0) is likely to underperform the others that are also conditioned on the leaf nodes. Even though the phrase representations are derived from the lexical level via a bottom-up encoding, we believe it is unable to fully capture the lexical information of the source sentence. Through the use of the gating scalar, the hierarchical model achieves progressive improvements, as shown in Tables 3 and 4, the problem of over-translation is also alleviated. The representations of non-leaf nodes can be regarded as supplements in the translation process.
5 Qualitative Analysis Figure 6 shows an English sentence and its binary tree representation, together with the corresponding Chinese translations produced by the different NMT models. All the models successfully give the correct Chinese translation "该 组 织 不 会" for the first three words of the English sentence "the organization wouldn't". Differences appear in the translation of the fourth word, and these lead to markedly different meanings. The translation "使用 其 成员国 以外 的 武装力量" output by the sequential model, means "use the armed forces other than its member states" where "other than its member states" is incorrectly interpreted as a complement to "armed forces". This is caused by the intrinsic limitations of the sequential model, whereby it is unable to properly interpret the syntactic relationship of words. By explicitly incorporating the syntactic information, both the proposed hierarchical model and the tree-based model can accurately attend to the dashed section of Figure 6, and the translations can be correctly generated to reflect the meaning of the source sentence. The distinction between the translations produced by the original tree-based model and our hierarchical model is the interpretation of the words "areas outside". The tree-based model interprets it into "境 外 (outside)", while our model correctly translates it into "以外 的 地区 (areas outside)". We believe that, with the help of global and local contextual information, our model is able to capture the short as well as long range dependencies.
We conducted an in-depth analysis of the BPE segmented units of rare words. It was observed that the sub-word units could be categorized into three groups. The first group of units involve the phonetic Romanization (Pinyin) of Chinese. In translation, these are simply transliterated into the corresponding Chinese characters. As shown in the second row of Table 5, "Liu/jing/min" is a person's name. The segmented units are the phonetic representations. Both models can successfully transliterate this into the Chinese equivalent, "刘/敬/民". The second group of sub-  Table 5: Translation examples of sub-words, where '/' indicates a separation between sub-word units. The first two columns show the segmented words and their Chinese references. The last two columns report the translations given by the hierarchical and sequential models respectively.
word units are likely to represent the word morphemes. The words are segmented into sub-word units, which are to some extent close to the linguistic word stems and suffixes. For example, the word "adventurer" is segmented into "adventur/er", which is correctly translated into the Chinese translations "探险/家" and "探险/者" respectively by the hierarchical and sequential models, while the third group of sub-word units offer no linguistic interpretation. It is easy to see, using the BPE algorithm, that the identification of subword units is merely based on their frequency in the training data, with the result that not all units are well-formed linguistic morphemes. However, an interesting finding arises regarding the translation of these segmented units. In the sequential model, the word is incorrectly translated; however, it can be correctly translated by the hierarchical model. Taking "hi/k/ed" as an example, the sequential model gives an incorrect translation "发生(happened)", while the hierarchical model translates it into "上升(rise)" which is a synonym of "hiked". This result indicates that in our hierarchical model, the parent node of hierarchical representation for sub-word units "hi/k/ed" is better able to capture the meaning of the word as a whole; this cannot be captured independently by the sequential model.

Conclusion
In this paper, we propose an improved NMT system with a novel bidirectional hierarchical encoder, which enhances the source-side representations of a sentence, that is, both phrases and words, with local and global context information. By introducing a tree-based rare word encoding, the hi-erarchical model is extended to sub-word level in order to alleviate the problem of OOVs. To effectively leverage the enhanced hierarchical representations, we also propose a weighted variant of the attention model which dynamically adjusts the proportion of conditional information between the lexical and phrase annotation vectors. Experimental results for NIST English-Chinese translation tasks demonstrate that the proposed model significantly outperforms the vanilla tree-based and sequential NMT models.