Content Word Aware Neural Machine Translation

Neural machine translation (NMT) encodes the source sentence in a universal way to generate the target sentence word-by-word. However, NMT does not consider the importance of word in the sentence meaning, for example, some words (i.e., content words) express more important meaning than others (i.e., function words). To address this limitation, we first utilize word frequency information to distinguish between content and function words in a sentence, and then design a content word-aware NMT to improve translation performance. Empirical results on the WMT14 English-to-German, WMT14 English-to-French, and WMT17 Chinese-to-English translation tasks show that the proposed methods can significantly improve the performance of Transformer-based NMT.


Introduction
Neural machine translation (NMT) models (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) often utilize the global neural networks to encode all words for learning the sentence representation and the context vector, and computes the accuracy of each generated target word in a universal manner. Meanwhile, each generated target word makes the same contribution to the optimization of the NMT model, regardless of its importance. Actually, there lacks a mechanism to guarantee that NMT captures the information related to word importance when predicting translations.
Intuitively, content words express more important meanings than function words, which indicates their comparative significance. To evaluate this, we randomly masked content or function words with UNK in a source sentence. Figure 1 shows that the BLEU scores of the test set decreased much * Corresponding author more substantially when parts of content words were randomly replaced with UNK on the WMT14 English-to-German task, which is in line with the findings in He et al. (2019)'s work.
To address this limitation, we propose a content word-aware NMT model that exploits the results of translation using a sequence of content words learned by a simple content word recognition method. Inspired by the works of (Setiawan et al., 2007(Setiawan et al., , 2009Zhang and Zhao, 2013), we first divide words in a sentence into content words and other function words depending on term frequencyinverse document frequency (TF-IDF) constraints. Two methods are designed to utilize the sequence of content word on the source and target sides: 1) We encode the content words of the source sentence as a new source representation, and learn an additional content word context vector based on it to improve translation performance; 2) A specific loss for content words of the target sentence is introduced to compensate for the original training objection, to obtain a content word-aware NMT model. Empirical results on the WMT14 English-to-German, WMT14 Englishto-French, and WMT17 Chinese-to-English tasks show the effectiveness of the proposed method.
2 Background: Transformer-based NMT In Transformer-based NMT (Vaswani et al., 2017), the encoder is composed of a stack of L identical layers, each of which contains two sub-layers. The first sub-layer is a self-attention module (ATT), and the second sub-layer is a position-wise fully connected feed-forward network (FNN). A residual connection (He et al., 2016) is applied between the sub-layers, and layer normalization (LN) (Ba et al., 2016) is performed. Formally, the l-th identical layer of this stack is as follows: {Q l−1 e , K l−1 e , V l−1 e } are query, key, and value vectors that are transformed from the (l-1)-th layer H l−1 . For example, {Q 0 , K 0 , V 0 } are packed from the H 0 learned by the positional encoding mechanism (Gehring et al., 2017).
Similarly, the decoder is composed of a stack of L identical layers. Compared with the stacked encoder, it contains an additional attention sublayer to compute alignment weights for the output of the encoder stack H L : where Q l−1 d , K l−1 d , and V l−1 d are query, key, and value vectors, respectively, that are transformed from the (l-1)-th layer S l−1 in time-step i. {K L e , V L e } are transformed from the L-th layer of the encoder. The top layer of the decoder S L i is used to generate the next target word y i by a linear, potentially multi-layered function: where W o and W w are projection matrices. To obtain the translation model, the training objection maximizes the conditional translation probabilities over the training data set {[X, Y]}:

Content Word Recognition
We explore the effects of content words in a sentence for NMT. Specifically, we propose a content word recognition method based on the TF-IDF (Chen et al., 2019;). An input sentence of length J m is treated as a document D m , and the TF-IDF T I j for each word d j in D m is computed: where k j,m represents the number of occurrences of the j-th word in the input sentence d t ; |M | is the total number of sentences in the monolingual data; and |m : d j ∈ D m | is the number of sentences including word d j in the monolingual data. We then select a fixed percentage N (30% in the experiment) of word with high TF-IDF scores in the sentence as content words. Note that we focus on statistics related to word frequency here, instead of the linguistic criteria; this method of approximation eliminates the need for additional language-specific resources.

Content Word Aware NMT
In this section, we propose two ways to make use of the information on content words, designing three content word-aware NMT models. The proposed method of content word recognition is first added as an additional module to the encoder to learn the sequence of source content words X from the input source sentence. X is mapped and fed into the shared encoder  in Eq.(1) to learn an additional source representation of content words H L . An multi-head attention module is then introduced to the decoder to learn the context vector C l i based content words at time-step i, and C l i is used to enhance the output S l i : where K L e and V L e of the content words are transformed from the L-th layer of the encoder. Finally, the top layer of the decoder S L i , which is enhanced by the contextual vector of the content words C l d , is used as input to the Eq.
(3) to compute the probabilities of the next target word y i at timestep i:  Note that both the original source representation H L and proposed content word based representation H L are learned by a shared encoder using our content word recognition module.

Target Content Word-Aware Loss
Like the source sentence, the target sentence also contains content words. We thus first identify a sequence of content words b from the target reference translation y according to the proposed content word recognition method (see Section 3). We then introduce an addition loss term as a measure of the content words, which encourages the translation model to attend to the translation of the content words. Formally, the training objective is revised as: where λ is a hyper-parameter empirically set to 0.4 in this paper. Note that the introduced content word-aware loss works without any new parameters and influences only the computation of loss during the training of the standard NMT model.

Proposed Translation Models
Based on the above two strategies, we design three NMT models: 1) SCWAContext: The source content words are used to learn an additional context vector to improve the prediction of target word (see Figure 2(a)); 2) TCWALoss: The target content words are used to compute an additional loss to guide the training of the translation model (see Figure 2(b)); 3) BCWAContLoss: It combines SCWAContext and TCWALoss to capture the content words of both the source and the target sentence to further improve translation performance.

Setup
The proposed methods were evaluated on the WMT14 English-to-German (EN-DE), WMT14 English-to-French (EN-FR), and WMT17 Chineseto-English (ZH-EN) tasks. The EN-DE corpus consists of 4M sentence pairs, the ZH-EN corpus of 22M sentence pairs, and the EN-FR corpus of 36M sentence pairs. We used the case-sensitive 4gram BLEU score as evaluation metric. The results of the newstest2014 test sets are reported for the EN-DE and EN-FR tasks, and the newstest2017 test set is reported for the ZH-EN task. The byte pair encoding algorithm (Sennrich et al., 2016) was applied to encode all sentences to limit the size of the vocabulary to 40K. The other configurations were identical to those in (Vaswani et al., 2017). The poposed models were implemented by using  (Vaswani et al., 2017) 27.3 N/A 65.0M N/A N/A 38.1 N/A +Context-Aware SANs (Yang et al., 2019a) (Vaswani et al., 2017) 28.4 N/A 213.0M N/A N/A 41.0 N/A +Context-Aware SANs (Yang et al., 2019a) Table 1: Results of the EN-DE, EN-FR, and ZH-EN tasks. "#Speed" and "#Param" denote the training speed (tokens/second) and the size of model parameters, respectively. "+" after a score indicates that the proposed method was significantly better than the Transformer at significance of p <0.01 (Collins et al., 2005).
the fairseq toolkit (Ott et al., 2019). Table 1 shows results of the proposed method over our implemented Trans.base/big models which have similar BLEU scores with the original Transformer for the EN-DE and EN-FR tasks. We then make the following observations: 1) All proposed three word-aware NMT models outperformed the baseline Transformer model. This indicates that using information on the importance of words to enhance the translation of content words is helpful for the NMT model.

Main Results
2) +SCWAContext performed better than +TCWALoss. The NMT model was more sensitive to information on source content words than target content words. +BCWAContLoss outperformed +SCWAContext and +TCWALoss, especially is superior to the existing +Context-Aware, +CSANs, and +BIARN. This suggests that the sequences of content words of both source and the target can be used together to further improve translation performance.
3) The parameters of the proposed models only slightly increased. In addition, Trans.base+BCWAContLoss delivered an comparable performance to Trans.big, which contained many more parameters than Trans.base+BCWAContLoss.
This indicates that the improvement in performance did not occur owing to a greater number of parameters. The training speeds of our models were slightly lower than those of Trans.base.

Evaluating Translation of Content Words
We apply the proposed content word recognition method to the generated translation and the reference translation of test set, and thus extract two short sequences of including 30% of content words.
We compute the accuracy of unigram content word between the extracted two short sequences, as shown in   Figure 4 shows the results of +TCWALoss model on the EN-DE and ZH-EN test sets with different hyper-parameter λ. When λ increased from 0 to 0.4, the BLEU scores of +TCWALoss model improved by +0.8 points over Trans.base model. This means that the proposed content wordaware loss is useful for training NMT model. Subsequently, larger values of λ reduced the BLEU scores, suggesting that excessive biased content word translation may be weak at translating function words. Therefore, we set the hyperparameter λ to 0.4 to control the loss of target content words in our experiments (Table 1).

Content Word Recognition based on Function Word Frequency
Instead of directly identify content words, we identify the function words as the T most frequent words in the corpus. Furthermore, after we remove the function words in a sentence x={x 1 , · · · , x J }, all the remaining words will be treated as a sequence (maintain the original order) of content words X according to the (Setiawan et al., 2007(Setiawan et al., , 2009Zhang and Zhao, 2013)'s work. Figure 5 shows the results of Trans.base+SCWAContLoss on the EN-DE and ZH-EN test sets with different number of the top T function words. Trans.base+SCWAContLoss obtained the highest BLEU scores on the both test sets over the Trans.base on modeling T = 256.

Conclusion and Future Works
This paper explored the importance of word for NMT. We divided words of one sentence into content and function words through word frequency-related information. Our proposed NMT models, that are easy to implement and not much time and space cost, are introduced to the training and inference, and can improve the representation and translation of content words. In future work, we will investigate the impact of fine-grained word categories (such as nouns, verbs, and adjectives) on the translation performance and design specific methods according to these categories.