Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.


Introduction
Recent advances in deep learning (Bahdanau et al., 2015;Wu et al., 2016;Vaswani et al., 2017) have aided in the development of neural machine translation (NMT) models to achieve state-of-the-art results in several language pairs. But a large number of high-quality sentence pairs must be fed into * These authors contributed equally to this work. these models to train them effectively (Koehn and Knowles, 2017); and in fact lack of such a corpus affects the performance thereof severely. Although there have been efforts to improve machine translation in low-resource contexts, particularly using, for example, comparable corpora (Irvine and Callison-Burch, 2013), small parallel corpora (Gu et al., 2018) or zero-shot multilingual translation (Johnson et al., 2017), such languages are yet to achieve noteworthy results  compared to high-resource ones. Unfortunately, Bengali, the seventh (fifth) most widely spoken language in the world by the number of (native 1 ) speakers, 2 has still remained a low-resource language. As of now, only a few parallel corpora for Bengali language are publicly available (Tiedemann, 2012) and those too suffer from poor sentence segmentation, resulting in poor alignments. They also contain much noise, which, in turn, hurts translation quality (Khayrallah and Koehn, 2018). No previous work on Bengali-English machine translation addresses any of these issues.
With the above backdrop, in this work, we develop a customized sentence segmenter for Bengali language while keeping uniformity with the English side segmentation. We experimentally show that better sentence segmentation that maintains homogeneity on both sides results in better alignments. We further empirically show that the choice of sentence aligner plays a significant role in the quantity of parallel sentences extracted from document pairs. In particular, we study three aligners and show that combining their results, which we name 'Aligner Ensembling', increases recall. We introduce 'Batch Filtering', a fast and effective method for filtering out incorrect alignments. Using our new segmenter, aligner ensemble, and batch filter, we collect a total of 2.75 million high-quality parallel sentences from a wide variety of domains, more than 2 million of which were not previously available. Training our corpus on NMT models, we outperform previous approaches to Bengali-English machine translation by more than 9 BLEU (Papineni et al., 2002) points and also show competitive performance with automatic translators. We also prepare a new test corpus containing 1000 pairs made with extensive manual and automated quality checks. Furthermore, we perform an ablation study to validate the soundness of our design choices.
We release all our tools, datasets, and models for public use. To the best of our knowledge, this is the first ever large scale study on machine translation for Bengali-English pair. We believe that the insights brought to light through our work may give new life to Bengali-English MT that suffered so far for being low in resources. We also believe that our findings will also help design more efficient methods for other low-resource languages.

Sentence Segmentation
Proper sentence segmentation is an essential prerequisite for sentence aligners to produce coherent alignments. However, segmenting a text into sentences is not a trivial task, since the end-of-sentence punctuation marks are ambiguous. For example, in English, the end-of-sentence period, abbreviations, ellipsis, decimal point, etc. use the same symbol (.). Since either side of a document pair can contain Bengali/English/foreign text, we need a sentence segmenter to produce consistent segmentation in a language-independent manner.  -Rfou' et al., 2013), do not work particularly well for Bengali sentences with abbreviations, which is common in many domains. For instance, Polyglot inaccurately splits the input sentence in Figure 1 into three segments, whereas the English side can successfully detect the non-breaking tokens. Not only does this corrupt the first alignment, but also causes the two broken pieces to be aligned with other sentences, creating a chain of incorrect alignments.
SegTok, 3 a rule-based segmentation library, does an excellent job of segmenting English texts. Seg-Tok uses regular expressions to handle many complex cases, e.g., technical texts, URLs, abbreviations. We extended SegTok's code to have the same functionality for Bengali texts by adding new rules (e.g., quotations, parentheses, bullet points) and abbreviations identified through analyzing both Bengali and English side of our corpus, side-byside enhancing SegTok's English segmentation correctness as well. Our segmenter can now address the issues like the example mentioned and provide consistent outputs in a language-agnostic manner.
We compared the performance of our segmenter on different aligners against Polyglot. We found that despite the number of aligned pairs decreased by 1.37%, the total number of words on both sides increased by 5.39%, making the resulting parallel corpus richer in content than before. This also bolsters our hypothesis that Polyglot creates unnecessary sentence fragmentation.

Aligner Descriptions
Most available resources for building parallel corpora come in the form of parallel documents which are exact or near-exact translations of one another. Sentence aligners are used to extract parallel sentences from them, which are then used as training examples for MT models. Abdul-Rauf et al. (2012) conducted a comparative evaluation of five aligners and showed that the choice of aligner had considerable performance gain by the models trained on the resultant bitexts. They identified three aligners with superior performance: Hunalign (Varga et al., 2005), Gargantua (Braune and Fraser, 2010), and Bleualign (Sennrich and Volk, 2010).
However, their results showed performance only in terms of BLEU score, with no indication of any explicit comparison metric between the aligners (e.g., precision, recall). As such, to make an intrinsic evaluation, we sampled 50 documents from four of our sources (detailed in section 4.2) with their sentence counts on either side ranging from 20 to 150. We aligned sentences from these documents manually (i.e., the gold alignment) and removed duplicates, which resulted in 3,383 unique sentence pairs. We then aligned the documents again with the three aligners using our custom segmenter. Table 1 shows performance metrics of the aligners.

Aligner Ensembling and Filtering
From the results in Table 1, it might seem that Hunalign should be the most ideal aligner choice. But upon closer inspection, we found that each aligner was able to correctly align some pairs that the other two had failed to do. Since we had started from a low-resource setup, it would be in our best interest if we could combine the data extracted by all aligners. As such, we 'ensembled' the results of the aligners as follows. For each combination of the aligners (4 combinations in total; see Table  2), we took the union of sentence pairs extracted by each constituent aligner of the said combination for each document. The performance of the aligner ensembles is shown in Table 2. We concatenated the first letters of the constituent aligners to name each ensemble (e.g., HGB refers to the combination of all three of them). Table 2 shows that BH achieved the best F 1 score among all ensembles, even 0.89% above the best single aligner Hunalign. Ensembling increased the recall of BH by 8.94% compared to Hunalign, but also hurt precision severely (by 7.05%), due to the accumulation of incorrect alignments made by each constituent aligner. To mitigate this effect, we used the LASER 4 toolkit to filter out incorrect alignments. LASER, a cross-lingual sentence representation model, uses similarity scores between the embeddings of candidate sentences to perform as both aligner (Schwenk et al., 2019)    filter on top of the ensembles, varied the similarity margin (Artetxe and Schwenk, 2019) between 0.90 to 1.10 with 0.01 increment, and plotted the performance metrics in Figure 2. We also reported the performance of LASER as a standalone aligner (referred to as L in the figure; +L indicates the application of LASER as a filter). The dashed lines indicate ensemble performance without the filter.
As Figure 2a indicates, ensembles achieve significant gain on precision with the addition of the LASER filter. While recall (Figure 2b) doesn't face a significant decline at first, it starts to take a deep plunge when margin exceeds 1.00. We balanced between the two by considering the F 1 score (Figure 2c). Table 3 shows the performance metrics of LASER and all filtered ensembles for which their respective F 1 score is maximized. Table 3 shows that despite being a good filter, LASER as an aligner does not show considerable performance compared to filtered ensembles. The best F 1 score is achieved by the BH ensemble with its margin set to 0.96. Its precision increased by 5.75% while trailing a mere 1.16% in recall behind its non-filtered counterpart. Compared to single Hunalign, its recall had a 7.78% gain, while lagging in precision by only 1.30%, with an overall F 1 score increase of 3.38%. Thus, in all future experiments, we used BH+L(0.96) as our default aligner with the mentioned filter margin.

Training Data and Batch Filtering
We categorize our training data into two sections: (1) Sentence-aligned corpora and (2) Documentaligned corpora.

Sentence-aligned Corpora
We used the corpora mentioned below which are aligned by sentences:

Document-aligned Corpora
The corpora below have document-level links from where we sentence-aligned them: Globalvoices: Global Voices 9 publishes and translates articles on trending issues and stories from press, social media, blogs in more than 50 languages. Although OPUS provides sentence-aligned corpus from Global Voices, we re-extracted sentences using our segmenter and filtered ensemble, resulting in a larger amount of pairs compared to OPUS.
JW: Agić and Vulić (2019) introduced JW300, a parallel corpus of over 300 languages crawled from jw.org, which also includes Bengali-English. They used Polyglot (Al-Rfou' et al., 2013) for sentence segmentation and Yasa (Lamraoui and Langlais, 2013) for sentence alignment. We randomly sampled 100 sentences from their Bengali-English corpus and found only 23 alignments to be correct. So we crawled the website using their provided instructions and aligned using our segmenter and filtered ensemble. This yielded more than twice the data than theirs.
Banglapedia: "Banglapedia: the National Encyclopedia of Bangladesh" is the first Bangladeshi encyclopedia. Its online version 10 contains over 5,700 articles in both Bengali and English. We crawled the website to extract the article pairs and aligned sentences with our segmenter and filtered ensemble.
Bengali Translation of Books: We collected translations of more than 100 books available on the Internet with their genres ranging from classic literature to motivational speeches and aligned them using our segmenter and filtered ensemble.
Bangladesh Law Documents: The Legislative and Parliamentary Affairs Division of Bangladesh makes all laws available on their website. 11 Some older laws are also available under the "Heidelberg Bangladesh Law Translation Project". 12 Segmenting the laws was not feasible with the aligners in section 3.1 as most lines were bullet points terminating in semicolons, and treating  The global approach suffered from another issue: memory usage. The datasets were too large to be fit into GPU as a whole. 15 Thus, we shifted the neighbor search to CPU, but that again took more than a day to complete. Also, the percentage of filtered pairs was quite higher than the local neighborhood approach, raising the issue of data scarcity again. So, we sought the following middle-ground between global and local approach: for each source, we merged all alignments into a single file, shuffled all pairs, split the file into 1k size batches, and then applied LASER locally on each batch, reducing running time to less than two hours.  In Table 5, we show the percentage of filtered out pairs from the sources for each neighborhood choice. The global approach lost about twice the data compared to the other two. The 1k batch neighborhood achieved comparable performance with respect to the more fine-grained document-level neighborhood while improving running time more than ten-folds. Upon further inspection, we found that more than 98.5% pairs from the documentlevel filter were present in the batched approach. So, in subsequent experiments, we used 'Batch Filtering' as standard. In addition to the documentaligned sources, we also used batch filtering on each sentence-aligned corpus in section 4.1 to remove noise from them. Table 4 summarizes our training corpus after the filtering.

Evaluation Data
A major challenge for low-resource languages is the unavailability of reliable evaluation benchmarks that are publicly available. After exhaustive searching, we found two decent test sets and developed one ourselves. They are mentioned below: SIPC: Post et al. (2012) used crowdsourcing to build a collection of parallel corpora between English and six Indian languages, including Bengali. 15 We used an RTX 2070 GPU with 8GB VRAM for these experiments.
Although they are not translated by experts and have issues for many sentences (e.g., all capital letters on English side, erroneous translations, punctuation incoherence between Bn and En side, presence of foreign texts), they provide four English translations for each Bengali sentence, making it an ideal test-bed for evaluation using multiple references. We only evaluated the performance of Bn→En for this test set.
SUPara-benchmark (Mumin et al., 2018): Despite having many spelling errors, incorrect translations, too short (less than 50 characters) and too long sentences (more than 500 characters), due to its balanced nature having sentences from a variety of domains, we used it for our evaluation.
RisingNews: Since the two test sets mentioned above suffer from many issues, we created our own test set. Risingbd, 16 an online news portal in Bangladesh, publishes professional English translations for many of their articles. We collected about 200 such article pairs and had them aligned by an expert. We had them post-edited by another expert. We then removed, through automatic filtering, pairs that had (1) less than 50 or more than 250 characters on either side, (2) more than 33% transliterations or (3) more than 50% or more than 5 OOV words . This resulted in 600 validation and 1000 test pairs; we named this test set "RisingNews".

Pre-processing
Before feeding into the training pipeline, we performed the following pre-processing sequentially: 1. We normalized punctuations and characters that have multiple unicode representations to reduce data sparsity.
2. We removed foreign strings that appear on both sides of a pair, mostly phrases from which both sides of the pair have been translated.
3. We transliterated all dangling English letters and numerals on the Bn side into Bengali, mostly constituting bullet points.
At this point, a discussion with respect to language classification is in order. It is a standard practice to use a language classifier (e.g., Joulin et al., 2017) to filter out foreign texts. But when we used it, it classified a large number of valid English sentences as non-English, mostly because they contained named entities transliterated from Bengali side. Fearing that this filtering would hurt translation of named entities, we left language classification out altogether. Moreover, most of our sources are bilingual and we explicitly filtered out sentences with foreign characters, so foreign texts would be minimal.
As for the test sets, we performed minimal preprocessing: we applied character and punctuation normalization; and since SIPC had some sentences that were all capital letters, we lowercased those (and those only).

Comparison with Previous Results
We compared our results with Mumin et al. (2019b), Hasan et al. (2019), and Mumin et al. (2019a. The first work used SMT, while the latter two used NMT models. All of them evaluated on the SUParabenchmark test set. We used the OpenNMT (Klein et al., 2017) implementation of big Transformer model (Vaswani et al., 2017) with 32k vocabulary on each side learnt by Unigram Language Model with subword regularization 17 (Kudo, 2018) and tokenized using SentencePiece (Kudo and Richardson, 2018). To maintain consistency with previous results, we used lowercased BLEU (Papineni et al., 2002) as the evaluation metric. Comparisons are shown in Table 6  Evident from the scores in Table 6, we outperformed all works by more than 9 BLEU points for Bn→En. Although for En→Bn the difference in improvement (5.5+) is not that much striking compared to Bn→EN, it is, nevertheless, commendable on the basis of Bengali being a morphologically 17 l=32, α=0.1 rich language.

Comparison with Automatic Translators
We compared our models' SacreBLEU 18 (Post, 2018) scores with Google Translate and Bing Translator, two most widely used publicly available automatic translators. Results are shown in Table  7.   Table 7 we can see that our models have superior results on all test sets when compared to Google and Bing.

Evaluation on RisingNews
We performed evaluation on our own test set, Ris-ingNews. We show our models' lowercased detokenized BLEU and mixedcased SacreBLEU scores in Table 8.

Bn→En En→Bn
39.04 27.73 36.1 27.7 We put great care in creating the test set by performing extensive manual and automatic quality control, and believe it is better in quality than most available evaluation sets for Bengali-English. We also hope that our performance on this test set will act as a baseline for future works on Bengali-English MT. In Figure 3, we show some example translations from the RisingNews test set.

Comparison with Human Performance
Remember that SIPC had four reference English translations for each Bengali sentence. We used the final translation as a baseline human translation and used the other three as ground truths (the fourth reference had the best score among all permutations). To make a fair comparison, we evaluated our model's score on the same three references বাংলােদেশ িডিজটাল বই �কাশ অেনক কারেণই গেড় উেঠিন, যার মেধ� রেয়েছ ই-বু ক িরডােরর উ� মূ ল� এবং চািহদার অভাব। Source: In Bangladesh, publishing of digital books has not yet picked up due to a lot of reasons such as the high price of e-book readers and lack of demand.
The publication of digital books in Bangladesh has not been developed for many reasons, including the high price and lack of demand of e-book readers.

Source:
Japan extended its full support to Bangladesh's call for safe and dignified return of Rohingyas to their homeland. Japan has expressed full support for Bangladesh's stance on safe and dignified repatriation of Rohingyas to their homelands.

Reference:
Prediction: In the middle of this month, situation began to deteriorate after the security forces launched an operation in the remote hilly area.

Ablation Study of Filtered Ensembles
To validate that our choice of ensemble and filter had direct impact on translation scores, we performed an ablation study. We chose four combinations based on their F 1 scores from section 3: To ensure apples to apples comparison, we only used data from the parallel documents, i.e., Globalvoices, JW, Banglapedia, HRW, Books, and Wiki sections.  Despite the superiority in data count, BH could not perform well enough due to the accumulation of incorrect alignments from its constituent aligners. A clearer picture can be visualized through Figure  4. BH+L(.96) mitigated both data shortage and incorrect alignments and formed a clear envelope over the other three, giving clear evidence that the filter and the ensemble complemented one another.

Related Works
The first initiative towards machine translation for Bengali dates back to the 90s. Sinha et al. (1995) developed ANGLABHARTI, a rule-based transla-  Dasgupta et al. (2004) conducted extensive syntactic analyses to write rules for constructing Bengali parse trees and designed algorithms to transfer between Bengali and English parse trees. Subsequently, Saha and Bandyopadhyay (2005) reported an example-based machine translation approach for translating news headlines using a knowledge base. Naskar and Bandyopadhyay (2005) described a hybrid between rule-based and example-based translation approaches; here terminals would end at phrases that would then be looked up in the knowledge base.
The improved translation quality of phrase-based statistical machine translation (SMT) (Koehn et al., 2003) and the wide availability of toolkits thereof (Koehn et al., 2007) created an increased interest in SMT for Bangali-English. As SMT was more data-driven, specialized techniques were integrated to account for the low amount of parallel data for Bengali-English. Among many, Roy (2009)  Although NMT is currently being hailed as the state-of-the-art, very few works have been done on NMT for the Bengali-English pair. Dandapat and Lewis (2018) trained a deployable general domain NMT model for Bengali-English using sentences aligned from comparable corpora. They combated the inadequacy of training examples by data augmentation using back-translation (Sennrich et al., 2016). Hasan et al. (2019);Mumin et al. (2019a) also showed with limited parallel data available on the web that NMT provided improved translation for Bengali-English pair.

Conclusion and Future Works
In this work, we developed a custom sentence segmenter for Bengali, showed that aligner ensembling with batch filtering provides better performance than single sentence aligners, collected a total of 2.75 million high-quality parallel sentences for Bengali-English from multiple sources, trained NMT models that outperformed previous results, and prepared a new test set; thus elevating Bengali from its low-resource status. In future, we plan to design segmentation-agnostic aligners or aligners that can jointly segment and align sentences. We want to experiment more with the LASER toolkit: we used LASER out-of-the-box, we want to train it with our data, and modify the model architecture to improve it further. LASER fails to identify one-tomany/many-to-one sentence alignments, we want to address this. We would also like to experiment with BERT (Devlin et al., 2019) embeddings for similarity search. Furthermore, we wish to explore semi-supervised and unsupervised approaches to leverage monolingual data and explore multilingual machine translation for low-resource Indic languages.