Meaningless yet meaningful: Morphology grounded subword-level NMT

We explore the use of two independent subsystems Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We show that, for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We propose a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, Bengali-Hindi and English-Bengali.


Introduction
Subword-level NMT is an NMT approach that can tackle OOV problem. In order to train an NMT (Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015) model for a languagepair, the size of vocabularies for source and target languages should be constant. But in reality, the vocabulary of a natural language is open. Some words in test data may be absent in system vocabulary. NMT model cannot interpret the semantics of these OOV words. So, translation quality deteriorates as the number of unseen (rare) words increases (Sutskever et al., 2014).
OOV words are mainly of three types described in Table 1. The first type of OOV words needs transliteration. But for translating the second type of OOV words, we need to look deeper. A word based NMT system treats 'house' and 'houses' as two com-  pletely different words, which limits the coverage of vocabulary. Morphological analyzers tackle this problem by segmenting 'houses' as 'house' and 's'. This way it can cover many words and their inflections too. The third type of OOV words are dealt by leveraging lexical similarity between language-pairs. Lexically similar languages share many words (cognates, loan words) with similar spelling, pronunciation and meaning. Subword-level approaches are effective ways for translation of such shared words. A character n-gram of a word is called a subword. It may or may not be meaningful. On the other hand, a morpheme is the smallest grammatical meaningful unit of a language. If we segment 'houses' as 'hou'+'ses', then 'hou' and 'ses' will be meaningless subwords. But if we segment 'houses' as 'house'+'s', then 'house' and 's' will be subwords as well as morphemes.

Related work
A word can be segmented as BPE (Sennrich et al., 2016), orthographic syllable (Kunchukuttan and Bhattacharyya, 2016), character (Chung et al., 2016;Costa-jussà and Fonollosa, 2016), Huffman encoding (Chitnis and DeNero, 2015). In our experiment we show that, for translation between linguistically close language-pair BPE subword segmentation is suitable, whereas for transla-tion between linguistically distant languagepair Morfessor-based segmentation is suitable. Our proposed subword segmentation approach utilizes benefit of both BPE and Morfessor (Creutz and Lagus, 2006; and performs well for both linguistically close and distant languagepairs. BPE (Gage, 1994) is originally a data compression technique. The main idea behind BPE is-"Find the most frequent pair of consecutive two character codes in the text, and then substitute an unused code for the occurrences of the pair." (Shibata et al., 1999)

BPE as subword unit
BPE works as subword segmentation method for both NMT (Sennrich et al., 2016) and SMT (Kunchukuttan and Bhattacharyya, 2017). In this method, two vocabularies are used: training vocabulary and symbol vocabulary. Words in training vocabulary are charactersequences followed by an end-of-word symbol. At first, all characters are added to symbol vocabulary. This step is followed by adding the most frequent symbol bigram to the vocabulary, and all its occurrences are replaced by a new symbol (merged symbol bigram). This step is repeated for a number of times, which is a hyperparameter.
Starting from character level as the number of merge operations is increased, primarily frequent character-sequences and then full words are also added as a single symbol. So, the number of merge operations balances between the NMT model vocabulary size and the length of training sentences. Symbol '@@' is used here to indicate the places of segmentations.

Hyperparameter selection of BPE
Higher number of merge operations adds almost all words to symbol vocabulary. It will prevent the NMT system to translate on subword level segmentation of words.
Using BPE subword segmentation, the average length of sentences is increased as words are broken into subwords. Larger the sentence size, more difficult it becomes for NMT to learn well from them (Bahdanau et al., 2015). So, proper tuning of this hyperparameter is needed. Higher number of merge operations makes the elements more word-like. Lower number of merge operations makes the elements more character-like, where sometimes character-to-character mappings add transliterated words in the translation output.

Comparison of BPE segmentation with Morfessor
The goal of morphological analyzers such as Morfessor is to segment a word into its morphs, the surface forms of morphemes.
Comparison between BPE subword segmentation and Morfessor is described below.
• BPE is a greedy approach. Morfessor takes highest probable segmentation of words and deals with local optima by removing and adding word tokens. So, Morfessor produces more acceptable morphological segmentation than BPE.
• Main advantage of BPE is solving OOV problem in two ways: i) some segmentations are almost morphological segmentation, and ii) some segmentations are nearly character-level segmentations. As a result, OOV words are either transliterated or produce partially correct translations. But in absence of some morphs in the dictionary, Morfessor does not produce character-level segmentations. In such cases, it faces OOV problem.
Our Morfessor-based segmentation algorithm takes all the valid words from the corpora and passes these through morfessor. After getting their morphological segmentation, we replace them in data at their respective places. Like BPE, '@@' is used here to indicate the places of segmentation. That means while decoding from subwords we need to join subwords having '@@' signs with next subword.

Our approach
The idea behind our proposed combined approach M-BPE comes from comparing BPE and Morfessor. The hypothesis of this approach is-"Words should be segmented into real morphs. After that, segmentation of morphs into subwords may be beneficial to handle open vocabulary." Words can be morphologically segmented by using Morfessor. BPE will be helpful for OOV morphs of type 1 and 3 described in section 1. Work-flow of this approach is described below.
Step 1: Use Morfessor on the set of all words from the dataset.
Step 2: Find and replace all occurrences of these words with their segmented form (symbol '**' is used to keep information of segmenting positions). For example-'googling' will be segmented into two morphs 'googl**' and 'ing'.

Hyperparameter selection of M-BPE
With increasing average number of elements per sentence, performance of an NMT model degrades (Bahdanau et al., 2015). Using the same number of merge operations for both BPE and M-BPE produces a higher number of elements per sentence in case of M-BPE than BPE because the Morfessor part of M-BPE increases the number of elements of a sentence before applying the BPE part on it. In order to get a fair comparison between BPE and M-BPE, we have adjusted their hyperparameter in such a way that average numbers of elements per sentence after segmentation become almost same. So, for maintaining that criterion, here we have kept the number of merge operations of M-BPE higher than that of BPE.

Experimental setup
There are three systems of subword segmentation in our experiment, namely-BPE, Mor-fessor and M-BPE. We have used subwordnmt 1 for BPE segmentation, Flatcat  and NLP Indic Library 2 for producing morphological segmentation of English and Indian words.

Datasets
We have used data from English-Hindi (En-Hi), English-Bengali (En-Bn) and Bengali-Hindi (Bn-Hi) language-pairs from health and tourism domain multilingual parallel Indian Language Corpora Intitiative (ILCI) corpus (Jha, 2010). We clean and tokenize the training corpus. English data was tokenized using the Stanford tokenizer (Klein and Manning, 2003) and then true-cased using truecase.perl provided in MOSES toolkit 3 . For Hindi and Bengali data, we tokenized using NLP Indic Library (Kunchukuttan et al., 2014). Then parallel sentences were divided into three parts for training, testing and tuning/validation. For each language-pair, we have 44,777 sentence-pairs in training data, 1,000 sentence-pairs in tuning data and 2,000 sentence-pairs in test data.

System details
After tokenization, words of source sentences are broken into subwords using a segmentation algorithm. NMT system receives a sequence subwords of a sentence as input and produces the output of a subword-sequence in target language. Then, subwords are combined to produce words in order to get an actual sentence in target language. We have used NEMATUS (Sennrich et al., 2017) as an attention-based encoder-decoder NMT system in our experiment.

Results and discussion
The example given below shows the difference among three segmentations:  (Papineni et al., 2002) when we train NMT models using sentences with increasing average number of elements (by tuning hyperparameters). Here, two paths indicate two different approaches of segmentation: i) from word level to BPE level, ii) from word level to M-BPE level via Morfessor level.  Table 2 compares among word-level, Morfessor-level, BPE-level and M-BPE level NMT output accuracies for three languagepairs. Tuned numbers of merge operations of BPE and M-BPE, for Bn-Hi, are 3k and 6k. In case of En-Hi, these are 10k and 90k respectively, and for En-Bn these are 7k and 15k respectively. Translation between lexically close language-pairs like Bn-Hi has more character-to-character mappings than En-Hi. For that reason, Bn-Hi language-pair needs a lower value of hyperparameter than English-Hindi.  Some findings from the results are listed below.
• For En-Hi and En-Bn language-pairs, Morfessor produces better quality translation than BPE.
• For Bn-Hi language-pair, BPE is capable of producing better translation than Mofessor as segmentation algorithm.
• M-BPE can maintain translation quality for all language-pairs.
In case of Bn-Hi language-pair, BPE helps in improvement of baseline (word-level) translation quality. But in case of En-Hi and En-Bn, it fails to show a considerable amount of improvement. En-Hi and En-Bn language-pairs are quite different from each other in terms of syntactical (word-order, morphology) and lexical similarities. Bengali and Hindi are much closer to each other in comparison with En-Hi and En-Bn. This property of Bn-Hi languagepair helps their translation model to figure out mappings between source and target n-grams. Grammatical rules of languages may not be revealed due to morphologically wrong segmentations. But it hardly affects Bn-Hi translation quality because of their syntactic similarities. Moreover, small subwords add transliterated words in output which is favorable for Bn-Hi translation.
In case of En-Hi and En-Bn, translation models do not easily find mappings between source and target random subwords. It will be useful, only if these subwords are real morphs. For these language-pairs, correct segmentation of word is necessary, not only for getting an accurate translation of word, but also for understanding its grammar (word order and function words). En-Hi and En-Bn languagepairs do not have much lexical similarity; small meaningless subwords do not help in that case; these can even degrade the translation quality.
M-BPE can segment words correctly. It can also produce small subwords by further segmenting morphs. So, by tuning its hyperparameter, we can make it suitable for all language pairs, i.e. linguistically close and linguistically distant language-pairs.

Conclusion and future work
As a subword segmentation algorithm, M-BPE outperforms baseline BPE in case of both lexically close and distant languagepairs. However, when compared with baseline Morfessor, improvement due to M-BPE depends on lexical closeness. For lexically close language-pair the improvement is significant. In that case, meaningless BPE subwords play a meaningful role in improving translation quality. Future investigation will be focused on the automatic tuning of hyperparameter for M-BPE.