Combining Character and Word Information in Neural Machine Translation Using a Multi-Level Attention

Natural language sentences, being hierarchical, can be represented at different levels of granularity, like words, subwords, or characters. But most neural machine translation systems require the sentence to be represented as a sequence at a single level of granularity. It can be difficult to determine which granularity is better for a particular translation task. In this paper, we improve the model by incorporating multiple levels of granularity. Specifically, we propose (1) an encoder with character attention which augments the (sub)word-level representation with character-level information; (2) a decoder with multiple attentions that enable the representations from different levels of granularity to control the translation cooperatively. Experiments on three translation tasks demonstrate that our proposed models outperform the standard word-based model, the subword-based model, and a strong character-based model.


Introduction
Neural machine translation (NMT) models (Britz et al., 2017) learn to map from source language sentences to target language sentences via continuous-space intermediate representations. Since word is usually thought of as the basic unit of language communication (Jackendoff, 1992), early NMT systems built these representations starting from the word level (Sutskever et al., 2014;Bahdanau et al., 2015;Cho et al., 2014;Weng et al., 2017). Later systems tried using smaller units such as subwords to address the problem of out-of-vocabulary (OOV) words (Sennrich et al., 2016;Wu et al., 2016).
Although they obtain reasonable results, these word or sub-word methods still have some potential weaknesses. First, the learned representations * Corresponding author. of (sub)words are based purely on their contexts, but the potentially rich information inside the unit itself is seldom explored. Taking the Chinese word 被打伤 (bei-da-shang) as an example, the three characters in this word are a passive voice marker, "hit" and "wound", respectively. The meaning of the whole word, "to be wounded", is fairly compositional. But this compositionality is ignored if the whole word is treated as a single unit.
Secondly, obtaining the word or sub-word boundaries can be non-trivial. For languages like Chinese and Japanese, a word segmentation step is needed, which must usually be trained on labeled data. For languages like English and German, word boundaries are easy to detect, but subword boundaries need to be learned by methods like BPE. In both cases, the segmentation model is trained only in monolingual data, which may result in units that are not suitable for translation.
On the other hand, there have been multiple efforts to build models operating purely at the character level (Ling et al., 2015a;Yang et al., 2016;Lee et al., 2017). But splitting this finely can increase potential ambiguities. For example, the Chinese word 红茶 (hong-cha) means "black tea," but the two characters means "red" and "tea," respectively. It shows that modeling the character sequence alone may not be able to fully utilize the information at the word or sub-word level, which may also lead to an inaccurate representation. A further problem is that character sequences are longer, making them more costly to process with a recurrent neural network model (RNN).
While both word-level and character-level information can be helpful for generating better representations, current research which tries to exploit both word-level and character-level information only composed the word-level representation by character embeddings with the word boundary information (Ling et al., 2015b;Costa-jussà and Fonollosa, 2016) or replaces the word representation with its inside characters when encountering the out-of-vocabulary words (Luong and Manning, 2016;Wu et al., 2016). In this paper, we propose a novel encoder-decoder model that makes use of both character and word information. More specifically, we augment the standard encoder to attend to individual characters to generate better source word representations ( §3.1). We also augment the decoder with a second attention that attends to the source-side characters to generate better translations ( §3.2).
To demonstrate the effectiveness of the proposed model, we carry out experiments on three translation tasks: Chinese-English, English-Chinese and English-German. Our experiments show that: (1) the encoder with character attention achieves significant improvements over the standard word-based attention-based NMT system and a strong character-based NMT system; (2) incorporating source character information into the decoder by our multi-scale attention mechanism yields a further improvement, and (3) our modifications also improve a subword-based NMT model. To the best of our knowledge, this is the first work that uses the source-side character information for all the (sub)words in the sentence to enhance a (sub)word-based NMT model in both the encoder and decoder.

Neural Machine Translation
Most NMT systems follow the encoder-decoder framework with attention mechanism proposed by Bahdanau et al. (2015). Given a source sentence x = x 1 · · · x l · · · x L and a target sentence y = y 1 · · · y j · · · y J , we aim to directly model the translation probability: where θ is a set of parameters and y < j is the sequence of previously generated target words. Here, we briefly describe the underlying framework of the encoder-decoder NMT system.

Encoder
Following Bahdanau et al. (2015), we use a bidirectional RNN with gated recurrent units (GRUs) (Cho et al., 2014) to encode the source sentence: where s l is the l-th source word's embedding, GRU is a gated recurrent unit, − → θ and ← − θ are the parameters of forward and backward GRU, respectively; see Cho et al. (2014) for a definition.
The annotation of each source word x l is obtained by concatenating the forward and backward hidden states: The whole sequence of these annotations is used by the decoder.

Decoder
The decoder is a forward RNN with GRUs predicting the translation y word by word. The probability of generating the j-th word y j is: where t j−1 is the word embedding of the ( j − 1)-th target word, d j is the decoder's hidden state of time j, and c j is the context vector at time j. The state d j is computed as The attention mechanism computes the context vector c i as a weighted sum of the source annotations, where the attention weight α ji is and where v a , W a and U a are the weight matrices of the attention model, and e jl is an attention model that scores how well d j−1 and ← → h l match. With this strategy, the decoder can attend to the source annotations that are most relevant at a given time.

Character Enhanced Neural Machine Translation
In this section, we present models which make use of both character-level and word-level information in the encoder-decoder framework.

Encoder with Character Attention
The encoder maps the source sentence to a sequence of representations, which is then used by the attention mechanism. The standard encoder operates purely on (sub)words or characters. However, we want to encode both, since both levels can be linguistically significant (Xiong et al., 2017).
To incorporate multiple levels of granularity, we extend the encoder with two character-level attentions. For each source word, the characters of the whole sentence can be divided into two parts, those inside the word and those outside the word. The inside characters contain information about the internal structure of the word. The outside characters may provide information about patterns that cross word boundaries. In order to distinguish the influence of the two, we use two separate attentions, one for inside characters and one for outside characters.
Note that we compute attention directly from the character embedding sequence instead of using an additional RNN layer. This helps to avoid the vanishing gradient problem that would arise from increasing the sequence length, and also keeps the computation cost at a low level. Figure 1 illustrates the forward encoder with character attentions. We write the character embeddings as o = o 1 · · · o k · · · o K . Let p l and q l be the starting and ending character position, respectively, of word x l . Then o p l · · · o q l are the inside characters of word x l ; o 1 · · · o p l −1 and o q l +1 · · · o K are the outside characters of word x l .
The encoder is an RNN that alternates between reading (sub)word embeddings and characterlevel information. At each time step, we first read the word embedding: Then we use the attention mechanisms to compute character context vectors for the inside characters: The outside character context vector c l O is calculated in a similar way, using a different set of pa- The inside and outside character context vectors are combined by a feed-forward layer and fed into the encoder RNN, forming the character-enhanced word representation − → h l : Note that this GRU does not share parameters with the GRU in (5). The backward hidden states are calculated in a similar manner.

Decoder with Multi-Scale Attention
In order to fully exploit the character-level information, we also make extensions to the decoder, so that the character-level information can be taken into account while generating the translation.
We propose a multi-scale attention mechanism to get the relative information of current decoding step from both word-level and character-level representations. This attention mechanism is build from the high-level to the low-level representation, in order to enhance high-level representation with fine-grained internal structure and context. The multi-scale attention mechanism is built (as shown in Figure 2) from word-level to characterlevel. Figure 2: Illustration of the decoder with our multiscale attention mechanism.
First, we get the word-level information. The context vector c w j is calculated following the standard attention model (Eq. 2-4). And the hidden stated j is updated.
Then we attend to the character-level representation, which provides more information about the word's internal structure. The context vector c c j is calculated based on the updated hidden state above, Finally, the word-level context vector c w j and character-level context vector c c j are concatenated: And the final context vector c j is used to help predict the next target word.
where d j is With this mechanism, both the (sub)word-level and character-level representations could be used to predict the next translation, which helps to ensure a more robust and reasonable choice. It may also help to alleviate the under-translation problem because the character information could be a complement to the word.

Experiments
We conduct experiments on three translation tasks: Chinese-English (Zh-En), English-Chinese (En-Zh) and English-German (En-De). We write Zh↔En to refer to the Zh-En and En-Zh tasks together.

Datasets
For Zh↔En, the parallel training data consists of 1.6M sentence pairs extracted from LDC corpora, with 46.6M Chinese words and 52.5M English words, respectively. 1 We use the NIST MT02 evaluation data as development data, and MT03, MT04, MT05, and MT06 as test data. The Chinese side of the corpora is word segmented using ICT-CLAS. 2 The English side of the corpora is lowercased and tokenized.
For En-De, we conduct our experiments on the WMT17 corpus. We use the pre-processed parallel training data for the shared news translation task provided by the task organizers. 3 The dataset consitst of 5.6M sentence pairs. We use newstest2016 as the development set and evaluate the models on newstest2017.

Baselines
We compare our proposed models with several types of NMT systems: • NMT: the standard attentional NMT model with words as its input (Bahdanau et al., 2015).
• RNN-Char: the standard attentional NMT model with characters as its input.
• Hybrid: the mixed word/character model proposed by Wu et al. (2016). • BPE: a subword level NMT model, which processes the source side sentence by Byte Pair Encoding (BPE) (Sennrich et al., 2016).
We used the dl4mt implementation of the attentional model, 4 reimplementing the above models.

Details
Training For Zh↔En, we filter out the sentence pairs whose source or target side contain more than 50 words. We use a shortlist of the 30,000 most frequent words in each language to train our models, covering approximately 98.2% and 99.5% of the Chinese and English tokens, respectively. The word embedding dimension is 512. The hidden layer sizes of both forward and backward sequential encoder are 1024. For fair comparison, we also set the character embedding size to 512, except for the CNN-Char system. For CNN-Char, we follow the standard setting of the original paper (Costa-jussà and Fonollosa, 2016). For En-De, we build the baseline system using joint BPE segmentation (Sennrich et al., 2017). The number of joint BPE operations is 90,000. We use the total BPE vocabulary for each side.
We use Adadelta (Zeiler, 2012) for optimization with a mini-batch size of 32 for Zh↔En and 50 for En-De.
Decoding and evaluation We use beam search with length-normalization to approximately find the most likely translation. We set beam width to 5 for Zh↔En and 12 for En-De. The translations are evaluated by BLEU (Papineni et al., 2002). We use the multi-bleu script for Zh↔En, 5 and the multi-bleu-detok script for En-De. 6

Results: Encoder with character attention
This set of experiments evaluates the effectiveness of our proposed character enhanced encoder. In Table 1, we first compare the encoder with character attention (Char-att) with the baseline wordbased model. The result shows that our extension of the encoder can obtain significantly better performance (+1.58 BLEU).
Then, in order to investigate whether the improvement comes from the extra parameters in the character layer, we compare our model to a word embedding enhanced encoder. When the word embedding enhanced encoder encodes a word, it attends to the word's embedding and other word embedding in the sentence instead of attending to the word's inside and outside character embeddings. The results show that the word embedding enhanced encoder (Word-att) only gets a 0.5 BLEU improvement than the baseline, while our model is significantly better (+1.58 BLEU). This shows that the benefit comes from the augmented character-level information which help the wordbased encoder to learn a better source-side representation.
Finally, we compare our character enhanced model with several types of systems including a strong character-based model proposed by Costa-jussà and Fonollosa (2016) and a mixed word/character model proposed by Wu et al. (2016). In Table 2, rows 2 and 2 confirm the finding of Yang et al. (2016) that the traditional RNN model performs less well when the input is a sequence of characters. Rows 4 and 4 indicate that Wu et al. (2016)'s scheme to combine of words and characters is effective for machine translation. Our model (row 5) outperforms other models on the Zh-En task, but only outperforms the wordbased model on En-Zh. The results may suggest that the CNN and RNN methods is also strong in building the source representation.   Table 3: Comparison of our models on top of the BPE-based NMT model and the original BPE-based model on the Chinese-English and English-Chinese translation tasks. Our models improve over the BPE baselines.

Results: Multi-scale attention
Rows 6 and 6 in Table 2 verify that our multiscale attention mechanism can obtain better results than baseline systems. Rows 7 and 7 in Table 2 show that our proposed multi-scale attention mechanism further improves the performance of our encoder with character attention, yielding a significant improvement over the standard wordbased model on both Zh-En (+2.02 vs. row 1) task and En-Zh translation task (+2.58 vs. row 1 ).
Compared to the CNN-Char model, our model still gets +1.97 and +1.46 BLEU improvement on Zh-En and En-Zh, respectively. Compared to the mixed word/character model proposed by (Wu et al., 2016), we find that our best model gives a better result, demonstrating the benefits of exploiting the character level information during decoding.

Results: Subword-based models
Currently, subword-level NMT models are widely used for achieving open-vocabulary translation. Sennrich et al. (2016) introduced a subword-level NMT model using subword-level segmentation based on the byte pair encoding (BPE) algorithm. In this section, we investigate the effectiveness of our character enhanced model on top of the BPE model. Table 3 shows the results on the Zh-En task  and En-Zh translation task. Rows 8 and 8 confirm that BPE slightly improves the performance of the word-based model. But both our character enhanced encoder and the multi-scale attention yield better results. Our best model leads to improvements of up to 1.58 BLEU and 1.68 BLEU on the Zh-En task and En-Zh translation task, respectively.
We also conduct experiments on the En-De translation task (as shown in Table 4). The result is consistent with Zh-En task and En-Zh translation tasks. Our best model obtains 1.43 BLEU improvement over the BPE model.

Analysis
We have argued that the character information is important not only for OOV words but also frequent words. To test this claim, we divided the MT03 test set into two parts according to whether the sentence contains OOV words, and evaluated several systems on the two parts. Table 6 lists the results. Although the hybrid model achieves a better result on the sentences which contain OOV words, it actually gives a worse result on the sentences without OOV words. By contrast, our model yields the best results on both parts of the data. This shows that frequent words also benefit from fine-grained character-level information. Table 5 shows three translation examples. Table 5(a) shows the translation of an OOV word 通 信 业 (tong-xin-ye, telecommunication industry). The baseline NMT system can't translate the whole word because it is not in the word vocabulary. The hybrid model translates the word to "communication," which is a valid translation of the first two characters 通信. This mistranslation also appears to affect other parts of the sentence adversely. Our model translates the OOV word correctly.
Table 5(b) shows two translation samples involving frequent words. For the compound word 被占领土 (beizhanlingtu, occupied territory), the baseline NMT system only partly translates the word as "occupation" and ignores the main part 领 土 (lingtu, territory). The CNN-Char model, which builds up the word-level representation from characters, also cannot capture 领土 (lingtu). However, our model correctly translates the word as "occupied territories." (The phrase "by Israel" in the reference was inserted by the translator.) The word 东西方 (dongxifang, east and west) and 冷战 (lengzhan, cold war) are deleted by the baseline model, and even the CNN-Char model translates 东 西 方 (dongxifang) incorrectly. By contrast, our model can make use of both words and characters to translate the word 东西方 (dongxifang) reasonably well as "eastern and western."

Related Work
Many recent studies have focused on using character-level information in neural machine translation systems. These efforts could be roughly divided into the following two categories.
The first line of research attempted to build neural machine translation models purely on characters without explicit segmentation. Lee et al. (2017) proposed to directly learn the segmentation from characters by using convolution and pooling layers. Yang et al. (2016) composed the high-level representation by the character embedding and its surrounding character-level context with a bidirectional and concatenated row convolution network. Different from their models, our model aims to use characters to enhance words representation instead of depending on characters solely; our model is also much simpler.
The other line of research attempted to combine character-level information with word-level information in neural machine translation models, which is more similar with our work. Ling et al. (2015a) employed a bidirectional LSTM to compose character embeddings to form the word-level information with the help of word boundary information. Costa-jussà and Fonollosa (2016) replaced the word-lookup table with a convolutional network followed by a highway network (Srivastava et al., 2015), which learned the word-level representation by its constituent characters. Zhao and Zhang (2016) designed a decimator for their encoder, which effectively uses a RNN to compute a word representation from the characters of the word. These approaches only consider word boundary information and ignore the word-level meaning information itself. In contrast, our model can make use of both character-level and wordlevel information. Luong and Manning (2016) proposed a hybrid scheme that consults character-level information whenever the model encounters an OOV word. Wu et al. (2016) converted the OOV words in the word-based model into the sequence of its constituent characters. These methods only focus on dealing with OOV words by augmenting the character-level information. In our work, we augment the character information to all the words.

Conclusion
In this paper, we have investigated the potential of using character-level information in word-based and subword-based NMT models by proposing a novel character-aware encoder-decoder framework. First, we extended the encoder with a character attention mechanism for learning better source-side representations. Then, we incorporated information about source-side characters into the decoder with a multi-scale attention, so that the character-level information can cooperate with the word-level information to better control the translation. The experiments have demonstrated the effectiveness of our models. Our analysis showed that both OOV words and frequent words benefit from the character-level information.
Our current work only uses the character-level information in the source-side. For future work, it will be interesting to make use of finer-grained information on the target side as well.