Improving Low-Resource NMT through Relevance Based Linguistic Features Incorporation

In this study, linguistic knowledge at different levels are incorporated into the neural machine translation (NMT) framework to improve translation quality for language pairs with extremely limited data. Integrating manually designed or automatically extracted features into the NMT framework is known to be beneficial. However, this study emphasizes that the relevance of the features is crucial to the performance. Specifically, we propose two methods, 1) self relevance and 2) word-based relevance, to improve the representation of features for NMT. Experiments are conducted on translation tasks from English to eight Asian languages, with no more than twenty thousand sentences for training. The proposed methods improve translation quality for all tasks by up to 3.09 BLEU points. Discussions with visualization provide the explainability of the proposed methods where we show that the relevance methods provide weights to features thereby enhancing their impact on low-resource machine translation.


Introduction
Neural machine translation (NMT) (Bahdanau et al., 2015;Luong et al., 2015;Vaswani et al., 2017;Kitaev et al., 2020) is known to give state-of-the-art translation quality for language pairs having abundance of parallel corpora. In case of resource poor scenarios, additional translation knowledge is acquired either through transfer learning in the form of pre-trained model parameters or by supplying external monolingual corpora. However, exploiting linguistic information effectively in low-resource conditions is still an under-researched field. Annotating the source side with various syntactic features e.g. part-of-speech (POS), lemma, dependency labels etc. can help in accurate translation when either a surface word is polysemous i.e. having multiple senses on varying context or the same word form is shared by different root words due to the inflectional nature of the language. Hence, for morphologically rich languages, ideally, the use of language specific knowledge should improve the translation quality.
To the best of our knowledge, there are not an adequate amount of research works on effectively incorporating arbitrary syntactic information into NMT. One possible reason could be that in high resource scenario the network learns from the large amount of training data to handle the problem caused by polysemy and morphological variants. In this direction, the notable works are done by Hoang et al., 2016;.  incorporated several features at the source side by employing a separate embedding matrix for each component of a source token including the word and its associated features. Finally, all embeddings are concatenated to enrich the representation. Inspired by this work, Hoang et al. (2016) developed a method to process feature sequences of the source sentence by separate recurrent neural networks (RNNs) and combined the output of all RNNs using a hybrid global-local attention strategy.  proposed a complex RNN architecture to model source-side linguistic knowledge. Their approach, at the first level, passes the morphological properties of each word sequentially through an RNN to build a composite representation of the features. These representations are further encoded by an RNN at the sentence level. Very recently, Pan et al. (2020) came up with a dual-source Transformer model to process words and features in isolation. The output of the word Transformer and the feature Transformer are fed to the decoder by two encoder-decoder attention sub-layers put in a series. All the models discussed so far ignore to consider whether a particular feature of a given word is really relevant for the translation task.
The above point motivates us to conduct the present research. We hypothesize that only including the features alongside the word by using a generalized embedding layer or forming a composite representation by taking all features together does not exploit the features completely. There should be some mechanism which can justify the relationship between a word and its supporting features as well. Driven by this idea, we come up with two simple strategies which measure the relevance of the word and the feature embeddings obtained from the output of the embedding layers. The first one is self relevance which considers the relevance of a feature with respect to itself. We apply an attention function to the feature embedding, which in turn generates a mask determining the importance of that feature. Finally, the mask is applied on the feature to effectuate the attention. Our second approach considers the feature relevance with respect to the corresponding word which is the most vital component of a source token. In this case, the attention function operates on a word-feature pair and returns the mask determining the word-based relevance of the input feature. For experimentation, we choose the Transformer network of Vaswani et al. (2017) and assess our proposed techniques on eight low-resource language pairs having diverse morphological variations taken from the Asian Language Treebank (Riza et al., 2016). Our hypothesis is empirically validated showing the fruitfulness of the relevance checking mechanisms in low-resource scenario. We achieve up to 3.09 BLEU points gain over the standard baseline models of  and Vaswani et al. (2017). In the next section, the related works are briefly described.

Related Works
Incorporating morphological information for NMT is a challenging area of research. A significant number of works involve dependency structure at the source side (Eriguchi et al., 2016;Shi et al., 2016;Bastings et al., 2017;Chen et al., 2017;Hashimoto and Tsuruoka, 2017;Li et al., 2017;Wu et al., 2018;Zhang et al., 2019). Eriguchi et al. (2016) proposed a syntax-aware encoding mechanism that encodes the source sentence maintaining the hierarchy of its dependency tree. A Long-Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) is used to encode the constituent phrases recursively in bottom-up direction. On the contrary, Shi et al. (2016) claimed that an RNN encoder can capture the inherent syntactic properties automatically from a source sentence as a by-product of training. They used a multi-layer LSTM and found that its different layers represent different types of syntax. This work gives the intuition that using linguistic prior at the source side may be redundant for translation. Following Eriguchi et al. (2016), Chen et al. (2017) presented a bidirectional tree encoder which builds the sentence representation considering both top-down and bottom-up directions of the dependency tree. A different approach was taken by Bastings et al. (2017) to propose a syntactic graph-convolutional network for encoding source sentences. Hashimoto and Tsuruoka (2017) came up with a model that learns the latent graph structure of a source sentence optimized by the translation objective. In the study by Li et al. (2017), the parse tree of the source sentence are linearized to label sequence. Next, three different encoding strategies using RNN namely parallel, hierarchical and mixed, are tried to integrate the dependency information. Wu et al. (2018) used dependency information of both the source and the target languages. As their model needs multiple encoders and decoders, so it is not worthy for use under low-resource condition. In the work of Zhang et al. (2019), the authors employed a supervised encoder-decoder dependency parser and used the outputs from the encoder as a syntax-aware representations of words, which in turn, are concatenated to the input embeddings of the translation model. Most recently, Bugliarello and Okazaki (2020) proposed dependency-aware self-attention in the Transformer that needs no extra parameter. For a pivot word, its self-attention scores with other words are weighted considering their distances from the dependency parent of the pivot word.
Apart from the above works, there are some studies which use the factors in the target side (Burlot et al., 2017;García-Martínez et al., 2016a;García-Martínez et al., 2016b). In general, their approach is to predict the roots and other morphological tags of the target words instead of producing the surface forms. Additionally, a morphological analyzer is applied for the reinflection task.

NMT Architecture
We employ the Transformer architecture for execution of our experiments. It is a specific type of encoderdecoder neural model that can be applied for sequence modelling tasks. Unlike RNN, the working policy of the Transformer does not rely on recurrence and hence, is much more parallelizable. As in general encoding-decoding framework, the learned embeddings of the source sentence s = (s 1 , ..., s m ) are mapped by the encoder to continuous representations z = (z 1 , ..., z m ). Each representation z i contains the contextual information about its surrounding tokens. To keep track of the order of the sequence, positional encodings are added to the input embeddings.
The encoder comprises a stack of identical layers each having two sub-layers -multi-head selfattention and position-wise fully connected feed-forward network. For each position in the source sentence, the self-attention block calculates a probability distribution over all positions. This distribution is then used to make a new representation of the reference position. Multi-head self-attention repeats the process a number of times resulting multiple representations in different subspaces. Finally, all of them are concatenated and the output is passed to the fully connected feed-forward network which contains two linear layers with ReLU activation function in between. As no recurrence is involved, so the encoding process can be parallelized during both training and inference phases.
The decoder is also made of a stack of identical layers having multi-head self-attention and feedforward networks with the addition of an extra computation called multi-head encoder-decoder attention over the output of the encoder stack. Here, the self-attention sub-layer is masked so that at a particular position it would not be able to consider the subsequent positions. This must be enforced for preserving the auto-regressive property. All sub-layers in the encoder and the decoder stacks have residual connections followed by layer normalization.

Generalized Source Embedding for Input Features
Sennrich and Haddow (2016) incorporated arbitrary morphological features through a generalized source embedding. Formally, let each token in the source language sentences be annotated with K number of features . For the k th feature, V k and E k denote the vocabulary and the feature embedding matrix respectively. E k ∈ R d k ×|V k | where d k is the dimension of the feature embedding. Finally, the embeddings of all features are concatenated to form the generalized embedding of a source token. For the source token s i , its embedding e i is formulated as follows.
Where s ik denotes the k th feature of s i and e ik denotes the vector embedding of the feature s ik . is the vector concatenation operation. Hence, the dimension of the resultant vector e i is K k=1 d k . Finally, e i is given as input to the encoder.

The Proposed Relevance Checking Methods
Our hypothesis is that only including the features through separate embedding matrices and then, combining all together by concatenation does not exploit the features completely. It would be beneficial to weight the feature vectors according to their relevance in translation. Now, the challenge is how to estimate the relevance of each feature component in order to improve the translation quality. To address this issue, we propose two empirical methods -self relevance and word-based relevance, which are described below.

Self Relevance
Let among K feature components of the source token s i = (s i1 , . . . , s iK ), the first component s i1 denotes the corresponding word 1 and the rest of them from s i2 to s iK denote various morphological properties of s i1 . The corresponding vector embeddings are e i1 , . . . , e iK . The self relevance of each Where W k ∈ R d k ×d k is the weight matrix for the k th feature, and is the element-wise multiplication operation between two vectors. The feature embedding e ik is given input to a linear layer with output dimension same as that of e ik followed by sigmoid activation function. The output is a mask vector with values in the range between 0 to 1, which signifies the self relevance of e ik . Next, the elementwise multiplication operation between the input embedding and the output mask produces the modified feature embedding e ik . Finally, all modified feature embeddings e i1 , . . . , e iK are concatenated to make the resultant embedding e i which is given as input to the Transformer encoder. In this way the features can determine their own impact on the final word representation. The process is depicted in Figure 1.

Word-based Relevance
This strategy evaluates the relevance of a morphological feature with respect to its corresponding word. For k ∈ {2, 3, . . . , K}, the word-based relevance of the embedding e ik is measured with respect to e i1 . Formally, is the weight matrix. While the self relevance measures the importance of a feature with respect to itself, in contrast the word-based relevance gives priority to the word component. The final embedding e i of the source token s i is obtained by concatenating e i2 , . . . , e iK with e i1 . We present the word-based relevance checking mechanism in Figure 2.  (2016), we add an extra feature to each subword in the source side in addition to the three linguistic features stated above. Every subword is annotated with one of the four markers -B (beginning), I (inside), E (end), S (single). The annotation is done with the help of the script provided in the corresponding url 3 . Table 1 depicts the structure of a sample source sentence. Baselines: We compare our proposed self relevance and word-based relevance methods with the following baselines.
• Transformer-base: It is the base configuration as proposed in (Vaswani et al., 2017). The experiments are done at subword-level without using external linguistic knowledge.
• Concat: This is the technique proposed by . The embeddings of a subword and its supporting features are concatenated.  reported the results on the RNN model. Whereas, we apply the same strategy on the Transformer.
• Add: Subword and feature embeddings are added to form the resultant embedding i.e. e i = K k=1 e ik .
• Linear: The embeddings are passed through a linear transformation followed by a ReLU activation.
Here, e i =ReLU(W ( K k=1 e ik )) where W is the weight matrix. Base  512  ----Concat  250  250  6  15  15  Add  512  512  512  512  512  Linear  250  250  6  15  15  Self-rel  250  250  6  15  15  Word-rel  250  250  6 15 15  Hyperparameters: We use the OpenNMT PyTorch implementation (Klein et al., 2017) to build our models and mostly follow the Transformer-base hyperparameter setting mentioned there 4 . There are 6 layers in each of the encoder and the decoder stacks. The number of multi-heads used is 8. The dimension of the fully-connected-feed-forward network is 2, 048. Total number of training steps is set to 200, 000 and after each 10, 000 steps validation checking is performed. We use the early-stopping strategy in training. If the validation accuracy does not improve for 5 consecutive validation checking steps, then training stops. Following , we keep the dimension of the final embedding which is fed to the Transformer, comparable across the models without and with using features so that the number of model parameters does not influence the performance. In Table 2 we list the embedding dimensions of the subword and its features for all experimental settings. Inference is done keeping beam size equal to 5. We carry out our experiments using single GPU with the specification of 32 GB Tesla V100-SXM2.

Results
BLEU Scores: We present the BLEU scores 5 (Papineni et al., 2002) of our proposed methods and the baselines in Table 3. The scores are computed after undoing the BPE segmentation of the translations. For en-fi, en-id, en-ms, en-my and en-vi, the self relevance checking strategy yields the best results (26.26, 30.41, 34.71, 16.53 and 27.74 respectively). For en-bg, en-hi and en-khm, the wordbased relevance method outperforms others (6.25, 21.63 and 25.13 respectively). Compared to the base configuration, maximum improvement is obtained for en-hi (18.54 → 21.63) and minimum for en-fi (25.59 → 26.26). Compared to  i.e. the concat combination, we get maximum BLEU points gain for en-fi (23.75 → 26.26) and minimum for en-bg (5.56 → 6.25). We also check the significance test 6 and found that these improvements are statistically significant with p-value < 0.05. Overall, the two methods proposed in this work come out to be the top two performers for seven out of the eight language pairs except en-id. These results unquestionably prove the effectiveness of the self relevance and the word-based relevance checking strategies.
Comparison of Model Parameters: Table 4 provides the number of model parameters in each configuration. The base model and the addition combination require the lowest and the highest number of parameters respectively. The remaining configurations (linear, concat, self relevance and word-based relevance) use comparable number of parameters. Larger models in low-resource settings like ours often overfit leading to low translation quality. Thus, trivially increasing the number of parameters should not en-bg en-fi en-hi en-id en-khm en-ms en-my   be the reason behind any improvements in translation quality. Despite the increase in the number of parameters, our methods show strong improvements in BLEU scores. Therefore, we can say that the performance gain is due to our proposed methods and not because of the excess parameters. Best Validation Set Perplexity: We report the best validation set perplexity of the models in addition to BLEU scores. Table 5 shows the results. Perplexity of a translation model indicates that given a source sentence, how good the model is at predicting the reference translation. So, we assess all models in terms of perplexity to test whether our proposed methods are really effective or not. As the lower perplexity suggests the better model, hence Table 5 shows that the self relevance gives the best scores for en-fi, enhi, en-khm and en-ms. For the rest of the language pairs, the word-based relevance outperforms others. Overall, these two methods are the top two performers for all language-pairs except en-fi. To ensure that the best validation set perplexity obtained is not just due to randomness, we plot the training time (number of batches processed) vs. perplexity on the validation set for initial 10, 000 batches of training with the interval of 100 batches. The experiments are done taking the best models in terms of BLEU score obtained for every individual language pair (self relevance model for en-fi, en-id, en-ms, enmy, en-vi and word-based relevance model en-bg, en-hi, en-khm as shown in Table 3) and we compare their performances with the base and the concat configurations. The plots are given in Tables 6 and 7. During initial 2, 000 − 2, 500 batches all models exhibit nearly the equal perplexity for each language pair. After that the differences are prominent showing that the relevance-based models yield the lowest perplexity. This is clearly seen in the plots of en-bg, en-hi, en-ms, en-my and en-vi. For en-fi, en-id and en-khm, though the baseline systems are sometimes better but overall, our proposed methods are superior. Impact of Individual Linguistic Features: Further we conduct the experiments taking linguistic features (lemma, POS and dependency labels) in isolation i.e. only one feature is used at a time. The goal is to check their potential exclusively. For each feature we execute our proposed methods as well as the base and the concat baselines. The results are reported in Table 8. In comparison with Table 3, combination of the three features proves to yield the best results. Table 8 shows that use of any of the features in isolation cannot beat the base configuration for en-fi (BLEU 25.59). For the remaining language pairs the best results are produced by either of the two proposed methods. Out of the three features, lemma and dependency labels prove to be most effective. Because, using the dependency labels gives the maximum en-bg en-fi en-hi en-id en-khm en-ms en-my en-  BLEU scores for en-bg, en-hi, en-khm and en-my. Whereas, the use of lemma gives the best results for en-id, en-ms and en-vi.  Table 9. The x-axis denotes the dimensions of the vector (from 1 st to 15 th as the POS embedding size is 15 given in table 2) and the y-axis denotes the sigmoid output. In contrast to the model of , where feature vectors are just appended to the word vector, the proposed relevance methods customize the features according to their importance in the translation task, which is empirically proved to be beneficial for translation. In case of the word-based relevance method, the sigmoid output of a particular feature follows specific pattern depending on the semantics of the associated word. For example, we take the part-of-speech NNP (proper noun) and show its sigmoid activation for two different categories of named entities. The first category is the days in a week i.e. from Sunday to Saturday. The other category is the names of different countries. These proper nouns are chosen as they have not been divided into subwords by BPE in our experiments. The plots are shown in Table 10. In the left graph, the similarities among the plots are clearly seen as all curves have the global minima at the 7 th dimension and the local minimas at the 1 st , 9 th and 11 th dimensions. The maximas are found at the dimensions 4 th and 15 th . In the right graph, there are local minimas at the dimensions 1 st , 7 th and 12 th and the maximas are at the 2 nd , 3 rd and 15 th dimensions. It gives the justification of why the word-based relevance checking is effective as the features are weighted similarly for semantically close words leading to more compact embedding representation of a source token. Results Using RNN Models: Further we explore the effect of feature relevance on attention-based RNN model. To do the experiments, we choose a LSTM network with the encoder composed of a stack of en-bg en-fi en-hi en-id en-khm en-ms en-my en-  2 bidirectional layers and a single layer unidirectional decoder. We apply Luong-attention (Luong et al., 2015) in our model. Four different configurations have been tested. They are -(i) no feature used (ii) concat (iii) self relevance (iv) word-based relevance. We use the same feature set as used for the Transformer. The BLEU scores obtained are presented in Table 11. Contrast to the results in Table 3, the relevance checking mechanisms do not perform significantly well when applied for RNN models. Out of the 8 language pairs, the word-based relevance method produces the best results for en-hi, en-khm and en-ms. For the remaining language pairs, the concat configuration outperforms others. Compared to the Transformer model, we get the highest BLEU score in en-bg (6.25 in Table 3 vs 6.85 in Table 11) using RNN. For the rest of the experiments, the Transformer produces the best results. These results disclose an important finding that although our objective is to enrich the source word representation by masking feature embeddings with attention, the proposed relevance methods are not model-agnostic. Rather, their efficacy is influenced by the network architecture where they favor the Transformer architecture.

Conclusion
In this article we revisit the ways to incorporate linguistic features in NMT. We argue that it is important to check the relevance of the features instead of just plugging them into the model. To establish our claim two novel methods are proposed and evaluated under extremely low-resource condition on eight language pairs. We design word-dependent as well as word-agnostic relevance checking mechanisms and show that by controlling the effects of features we can obtain substantial improvement in translation quality. Our models yield significantly higher BLEU scores compared to the baselines with a modest increase in the number of model parameters. Additionally we observe lower validation perplexity that shows applying feature relevance helps to reduce prediction uncertainty. The methods are further analyzed by visualization of the relevance weights. In case of the word-based relevance, the feature embeddings are tuned similarly based on the semantics of the corresponding word. It indicates that the proposed models actually pay attention to morphological features leading to enriched word representation. Moreover, we assess the proposed relevance methods on RNN model with attention. The results are not as satisfactory as obtained for the Transformer model, which requires further investigation. One notable issue in the present work is that the source language has been annotated by a robust and highly accurate parser which is unlikely to exist for a low-resource language. Checking the effectiveness of relevance methods is necessary when the annotation is noisy. The future extension of the present work will also focus to exploit the features under high resource scenario. In a resource rich setting usually features are redundant as the model learns from a large variety of context. Hence, using them effectively in that case will be beneficial for NMT research.