Lattice-Based Transformer Encoder for Neural Machine Translation

Neural machine translation (NMT) takes deterministic sequences for source representations. However, either word-level or subword-level segmentations have multiple choices to split a source sequence with different word segmentors or different subword vocabulary sizes. We hypothesize that the diversity in segmentations may affect the NMT performance. To integrate different segmentations with the state-of-the-art NMT model, Transformer, we propose lattice-based encoders to explore effective word or subword representation in an automatic way during training. We propose two methods: 1) lattice positional encoding and 2) lattice-aware self-attention. These two methods can be used together and show complementary to each other to further improve translation performance. Experiment results show superiorities of lattice-based encoders in word-level and subword-level representations over conventional Transformer encoder.


Introduction
Neural machine translation (NMT) has achieved great progress with the evolvement of model structures under an encoder-decoder framework (Sutskever et al., 2014;Bahdanau et al., 2014).Recently, the self-attention based Transformer model has achieved state-of-theart performance on multiple language pairs (Vaswani et al., 2017;Marie et al., 2018).Both representations of source and target sentences in NMT can be factorized in character (Costa-Jussa and Fonollosa, 2016), word (Sutskever et al., 2014), or subword (Sennrich et al., 2015) level.However, only using 1-best segmentation as inputs limits NMT encoders to express source sequences sufficiently and reliably.Many East Asian languages, including Chinese are written without explicit word boundary, so that their sentences need to be segmented into words firstly (Zhao et al., 2019;Cai et al., 2017;Cai and Zhao, 2016;Zhao et al., 2013;Zhao and Kit, 2011).By different segmentors, each sentence can be segmented into multiple forms as shown in Figure 1.Even for those alphabetical languages with clear word boundary like English, there is still an issue about selecting a proper subword vocabulary size, which determines the segmentation granularities for word representation.
In order to handle this problem, Morishita et al. (2018) used hierarchical subword features to represent sequence with different subword granularities.Su et al. (2017) proposed the first word-lattice based recurrent neural network (RNN) encoders which extended Gated Recurrent Units (GRUs) (Cho et al., 2014) to take in multiple sequence segmentation representations.Sperber et al. (2017) incorporated posterior scores to Tree-LSTM for building a lattice encoder in speech translation.All these existing methods serve for RNN-based NMT model, where lattices can be formulized as directed graphs and the inherent directed structure of RNN facilitates the construction of lattice.Meanwhile, the selfattention mechanism is good at learning the dependency between characters in parallel, which can partially compare and learn information from multiple segmentations (Cherry et al., 2018).Therefore, it is challenging to directly apply the lattice structure to Transformer.
In this work, we explore an efficient way of integrating lattice into Transformer.
Our method can not only process multiple sequences segmented in different ways to improve translation quality, but also maintain the characteristics of parallel computation in the Transformer.

Transformer
Transformer stacks self-attention and point-wise, fully connected layers for both encoders and decoders.Decoder layers also have another sublayer which performs attention over the output of the encoder.Residual connections around each layer are employed followed by layer normalization (Ba et al., 2016).
To make use of the order of the sequence, Vaswani et al. (2017) proposed Positional Encodings to indicate the absolute or relative position of tokens in input sequence which are calculated as: p (j,2i) = sin(j/10000 2i/d ) p (j,2i+1) = cos(j/10000 2i/d ), where j is the position, i is the dimension and d is the model dimension.Then positional encodings p 1:M = {p 1 , ..., p M } are added to the embedding of each token t 1:M = {t 1 , ..., t M } and are propagated to higher layers via residual connections.

Self-Attention
Transformer employs H attention heads to perform self-attention over a sequence individually and finally applies concatenation and linear transformation to the results from Conditions Explanation lad i < j = p < q ei:j is left adjacent to ep:q.rad p < q = i < j ei:j is right adjacent to ep:q.inc i ≤ p < q ≤ j ei:j includes ep:q.ind p ≤ i < j ≤ q ei:jis included in ep:q.its i < p < j < q or ei:j is intersected with ep:q.p < i < q < j pre i < j < p < q ei:j is preceding edge to ep:q.suc p < q < i < j ei:j is succeeding edge to ep:q.
Table 1: Relations possibly satisfied by any two different edges e i:j and e p:q in the lattice.Note that two equal signs cannot stand at the same time in condition inequality for inc and ind.
. each head, which is called multi-head attention (Vaswani et al., 2017).Every single head attention in multi-head attention is calculated in a scaled dot product form: where d is the model dimension, t 1:M is the input sequence and u ij are normalized by a softmax function: and α ij are used to calculate the final output hidden representations: where o 1:M is outputs and W Q ,W K , and W V are learnable projections matrices for query, key, and value in a single head, respectively.

Lattices
Lattices can represent multiple segmentation sequences in a directed graph, as they merge the same subsequence of all candidate subsequences using a compact way.
As shown in Figure 1, we follow Su et al. (2017) to apply different segmentator to segment an element1 sequence c 1:N = {c 1 , c 2 , ..., c N } into different word or subword sequences to construct a lattice G = V, E , a directed, connected, and acyclic graph, where V is node set and E is edge set, node v i ∈ V denotes the gap between c i and c i+1 , edge e i:j ∈ E departing from v i and arrives at v j (i < j) indicates a possible word or subword unit covering subsequence c i+1:j .
All the edges in the lattice G are the actual input tokens for NMT.For two different edges e i:j and e p:q , all possible relations can be enumerated as in Table 1.

Lattice-Based Encoders
We place all edges E in the lattice graph into an input sequence t 1:M = {t 1 , t 2 , ..., t M } for Transformer; then we modify the positional encoding to indicate the positional information of input tokens, namely all edges in the lattice graph.
In addition, we propose a lattice-aware selfattention to directly represent position relationship among tokens.The overall architecture is shown in Figure 2.
Lattice Positional Encoding (LPE) Original positional encoding indicates the order of the sequence in an ascending form {p 1 , p 2 , ..., p M }.We hypothesize that increasing positional encodings can indicate the order of sequential sentence.As shown in Figure 3, we scan a source sequence by element c 1:N = {c 1 , c 2 , ..., c N } (for example, c i is character in Figure 3) and record their position p 1:N = {p 1 , p 2 , ..., p N }.Then we use the positional encoding of the first element in lattice edge to represent current token's position, which can ensure that every edge in each path departing from v 0 and arriving at v N in lattice will have an increasing positional encoding order.The property mentioned above is easy to prove, since start and end points v i , v j of each edge e i:j strictly satisfy i < j and next edge e j:k will start from v j and thus get a larger positional encoding.Formally, for any input token t k , namely edge e i:j covering elements c i+1:j , positional encoding p i+1 will be used to represent its position and be added to its embedding.
Lattice-aware Self-Attention (LSA) We also directly modify self-attention to a lattice-aware way which makes self-attention aware of the relations between any two different edges.We modified Equations ( 1) and (3) in the same way of Shaw et al. (2018) to indicate edge relation: where r K ij and r V ij are relation embeddings which are added to the keys and values to indicate relation between input tokens t i and t j , namely edges e p:q and e k:l in lattice graph, respectively.
To facilitate parallel computation, we add an additional embedding (self) for a token when it is conducted dot-product attention with itself, so we train eight (seven in Table 1) different relation embeddings a V 1:8 and a K 1:8 as look-up table for keys and values, respectively.r K ij and r V ij can look up for a V 1:8 and a K 1:8 based on the relation between t i and t j .Figure 3 shows an example of embeddings in lattice-aware self-attentions based on the timestep of token fa-zhan-ju and fu.Since self-attention is computed parallelly, we generate a matrix with all lattice embeddings in it for each sentence which can be easily incorporated into standard self-attention by matrix multiplication.We use different relation embeddings for different Transformer layers but share the same one between different heads in a single layer.

Setup
We conducted experiments on the NIST Chinese-English (Zh-En) and IWSLT 2016 English-German (En-De) datasets.The Zh-En corpus consists of 1.25M sentence pairs and the En-De corpus consists of 191K sentence pairs.For Zh-En task, we chose the NIST 2005 dataset as the validation set and the NIST 2002NIST , 2003NIST , 2004NIST , 2006NIST , and 2008 datasets as test sets.For En-De task, tst2012 was used as validation set and tst2013 and tst2014 were used as test sets.For both tasks, sentence pairs with either side longer than 50 were dropped.We used the case-sensitive 4-gram NIST BLEU score (Papineni et al., 2002) as the evaluation metric and sign-test (Collins et al., 2005) for statistical significance test.
For Zh-En task, we followed Su et al. (2017) to use the toolkit 2 to train segmenters on PKU, MSR (Emerson, 2005), and CTB corpora (Xue et al., 2005), then we generated word lattices with different segmented training data.Both source and target vocabularies are limited to 30K.For En-De task, we adopted 8K, 16K and 32K 2 https://nlp.stanford.edu/software/segmenter.html#Download BPE merge operations (Sennrich et al., 2015) to get different segmented sentences for building subword lattices.16K BPE merge operations are employed on the target side.
We set batch size to 1024 tokens and accumulated gradient 16 times before a backpropagation.During training, we set all dropout to 0.3 and chose the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98 and = 10 −9 for parameters tuning.During decoding, we used beam search algorithm and set the beam size to 20.All other configurations were the same with Vaswani et al. (2017).We implemented our model based on the OpenNMT (Klein et al., 2017) and trained and evaluated all models on a single NVIDIA GeForce GTX 1080 Ti GPU.

Overall Performance
From Table 2, we see that our LPE and LSA models both outperform the Transformer baseline model of 0.58 and 0.42 BLEU respectively.When we combine LPE and LSA together, we get a gain of 0.91 BLEU points.Table 3 shows that our method also works well on the subword level.
The base Transformer system has about 90M parameters and our LPE and LSA models introduce 0 and 6k parameters over it, respectively, which shows that our lattice approach improves Transformer with little parameter accumulation.
During reduction.

Analysis 3
Effect of Lattice-Based Encoders To show the effectiveness of our method, we placed all edges in the lattice of a single sequence in a relative right order based on their first character, then we applied normal positional encodings (PE) to the lattice inputs on our base Transformer model.As shown in Table 4, our LPE and LSA method outperforms normal positional encodings by 0.39 and 0.23 BLEU respectively which shows that our methods are effective.
Complementary of LPE and LSA Our LPE method allows edges in all paths in an increasing positional encoding order which seems to focus on long-range order but ignore local disorder.While our LSA method treats all preceding and succeeding edges equally which seems to address local disorder better but ignore long-range order.
To show the complementary of these two methods, we also placed all edges of lattice in a single sequence in a relative right order based on their first character and use normal positional encodings and our LSA method; we obtained a BLEU of 40.90 which is 0.13 higher than single LSA model.From this, we can see that long-range position information is indeed beneficial to our LSA model.
3 All analysis experiments conducted on NIST dataset.
Our work is related to the source side representations for NMT.Generally, the NMT model uses the word as a basic unit for source sentences modeling.In order to obtain better source side representations and avoid OOV problems, recent research has modeled source sentences at character level (Ling et al., 2015;Costa-Jussa and Fonollosa, 2016;Yang et al., 2016;Lee et al., 2016), subword level (Sennrich et al., 2015;Kudo, 2018;Wu and Zhao, 2018) and mixed character-word level (Luong and Manning, 2016).All these methods show better translation performance than the word level model.As models mentioned above only use 1-best segmentation as inputs, lattice which can pack many different segmentations in a compact form has been widely used in statistical machine translation (SMT) (Xu et al., 2005;Dyer et al., 2008) and RNN-based NMT (Su et al., 2017;Sperber et al., 2017).To enhance the representaions of the input, lattice has also been applied in many other NLP tasks such as named entity recognition (Zhang and Yang, 2018), Chinese word segmentation (Yang et al., 2019) and part-of-speech tagging (Jiang et al., 2008;Wang et al., 2013).

Conclusions
In this paper, we have proposed two methods to incorporate lattice representations into Transformer.
Experimental results in two datasets on word-level and subword-level respectively validate the effectiveness of the proposed approaches.
Different from Veličković et al. (2017), our work also provides an attempt to encode a simple labeled graph into Transformer and can be used in any tasks which need Transformer encoder to learn sequence representation.

Figure 1 :
Figure 1: Incorporating three different segmentation for a lattice graph.The original sentence is "mao-yifa-zhan-ju-fu-zong-cai".In Chinese it is "贸易发展局 副总裁".In English it means "The vice president of Trade Development Council"

Figure 2 :
Figure 2: The architecture of lattice-based Transformer encoder.Lattice positional encoding is added to the embeddings of lattice sequence inputs.Different colors in lattice-aware self-attention indicate different relation embeddings.

Figure 3 :
Figure 3: Lattice positional encoding p i+1 (in green) for edge e i:j in the lattice graph and the relation embeddings r in lattice-aware self-attention based on the timestep of token fa-zhan-ju (in red) and fu (in purple).

Table 2 :
(Su et al., 2017)nslation performance on NIST Zh-En dataset.RNN and Lattice-RNN results are from(Su et al., 2017).We highlight the highest BLEU score in bold for each set.↑ indicates statistically significant difference (p <0.01) from best baseline.

Table 4 :
training, base Transformer performs about 0.714 steps per second while LPE + LSA model can process around 0.328.As lattice-based method usually seriously slows down the training, our lattice design and implementation over the Transformer only shows moderate efficiency Translation performance (BELU score) with normal positional encodings and normal positional encodings with LSA model on NIST Zh-En dataset.