A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.


Introduction
Statistical Machine Translation (SMT, henceforth MT) is a highly data driven field that relies on parallel language datasets for training, tuning and evaluation. Prime examples of such modernday digital Rosetta Stones include the United Nations corpus (six languages) and the European Parliamentary Proceedings corpus (20+ languages). 1 MT systems use these resources for model development and for evaluation. Large training data is often not available and researchers rely on other methods, such as pivoting to build MT systems. And while this addresses the question of training, there is still a need to tune and evaluate. In the case of Arabic, most of MT research and MT evaluation resources are focused on translation from Arabic into English, with few additional resources pairing Arabic with a half dozen languages. This paper showcases the effort to create a dataset, which we dub Arab-Acquis, to support the development and evaluation of machine translation systems from Arabic to the languages of the European Union and vice versa. Our approach is simply to exploit the existence of the JRC-Acquis corpus (Steinberger et al., 2006;, which has 22 languages in parallel, and translate a portion of it to Standard Arabic. We include two translations in Arabic for each sentence in the set to support robust multi-reference evaluation metrics. This provides us with the largest (and first of its kind) set of multilingual translation for Standard Arabic to date. It allows us to evaluate the quality of translating into Arabic from a set of 22 languages, most of which have no large high quality datasets paired with Arabic.

Related Work
In the context of MT research in general, multilingual resources (or parallel corpora) are central. Some of these resources exist naturally such as the United Nations corpus (Arabic, Chinese, English, French, Russian and Spanish) (Rafalovitch et al., 2009), the Canadian Hansards (French and English) (Simard et al., 1993), the European Parliament proceedings, EUROPARL, (21 languages in its latest release) (Koehn, 2005), and the JRC-Acquis (22 languages) (Steinberger et al., 2006;. Translations may also be commissioned to support MT research, as in the creation of an Arabic dialect to English translation corpus using crowdsourcing (Zbib et al., 2012). Such resources are necessary for the development of MT systems, and for the evaluation of MT systems in general. While training MT systems typically requires large collections in the order of millions of words, the automatic evaluation of MT requires less data; but evaluation data is expected to have more than one human reference since there are many ways to translate from one language to another (Papineni et al., 2002). The number of language pairs that are fortunate to have large parallel data is limited. Researchers have explored ways to exploit existing resources by pivoting or bridging on a third language (Utiyama and Isahara, 2007;Habash and Hu, 2009;El Kholy et al., 2013). These techniques have shown promise but can obviously only be pursued for languages with parallel evaluation datasets, which are not common. In some cases, researchers translated commonly used test sets to other languages to enrich the parallelism of the data, e.g., (Cettolo et al., 2011), while working on Arabic-Italian MT, translated a NIST MT eval dataset (Arabic to four English references) to French and Italian. For Arabic MT, the past 10 years have witnessed a lot of interest in translating from Arabic to English mostly due to large DARPA programs such as GALE and BOLT (Olive et al., 2011). There have been some limited efforts in comparison on translating into Arabic from English (Hamon and Choukri, 2011;Al-Haj and Lavie, 2012;El Kholy and Habash, 2012), but also between Arabic and other languages (Boudabous et al., 2013;Habash and Hu, 2009;Shilon et al., 2012;Cettolo et al., 2011). The JRC-Acquis collection, of which we translate a portion, is publicly available for research purposes and already exists in 22 languages (and others ongoing). As such, the Arab-Acquis dataset will open a pathway for researchers to work on MT from a large number of languages into Arabic and vice versa, covering pairs that have not been researched before. The dataset enables us to compare translation quality from different languages into Arabic without data variation. In this paper, we also present some initial benchmarking results using sentence pivoting techniques between all JRC-Acquis languages and Arabic.

Approach and Development of Arab-Acquis
We discuss next the design choices and the process we followed to create Arab-Acquis.

Desiderata
As part of the process of creating the Arab-Acquis translation dataset, we considered the following desiderata: • The dataset should have a large number of translations to maximize the parallelism.
• The original text should not have any restrictive copyrights.
• It is more desirable to extend datasets and data splits that are already used in the field • The dataset must be large enough to accommodate decent sized sets for tuning, development, and one or two testing versions.
• Each sentence is translated at least twice, by different translators from different languages.
• It is preferable to use professional translators with quality checks than to use crowdsourcing with lower quality translations.

Why JRC-Acquis?
Keeping these desiderata in mind, we decided to use the JRC-Acquis dataset (Steinberger et al., 2006; as the base to select translations from. JRC-Acquis is the JRC (Joint Research Centre) Collection of the Acquis Communautaire, which is the body of common rights and obligations binding all the Member States together within the European Union (EU). By definition, translations of this document collection are therefore available in all official EU languages (Steinberger et al., 2006). The corpus version we use contains texts in 22 official EU languages (see Table 2). The JRC-Acquis corpus text is mostly legal in nature, but since the law and agreements cover most domains of life, the corpus contains vocabulary from a wide range of subjects, e.g., human and veterinary medicine, the environment, agriculture, commerce, transport, energy, and science ). The JRC-Acquis is also a publicly available dataset that has been heavily used as part of international translation research efforts and shared tasks. It has a lot of momentum that comes from people having worked with. We follow the data split guidelines used by  and only translate portions that are intended for tuning, development and testing. These portions sum to about 12,000 sentences in total. All mentions of JRC-Acquis in the rest of this document will refer to the portion selected for translation into Arab-Acquis and not the whole JRC-Acquis corpus.

Translating the JRC-Acquis
For each sentence in JRC-Acquis, we created two Arabic references starting with English in one and French in the other. The choice of these two languages is solely reflective of their prominence in the Arab World. The two languages also have different structures and features that seed differences in wording, which is desirable for such a dataset.
We commissioned three individual companies (from Egypt, Lebanon and Jordan each) to translate the JRC-Acquis corpus into Arabic from both English and French. On average, the translation from English cost USD $0.056 per word (for 327,466 words), and the translation from French cost USD $0.073 per word (for 340,739 words). In total the translation cost just over USD $43,200. The files were distributed so that none of the companies would get the same file in both English and French. This allowed for two different translations for each file. The companies took 44 to 90 days to translate the files (65 working days on average).
We instructed the translation companies to maintain the original line formatting. We also stressed that the translation should be in the most natural and fluent Arabic to the translators. We did regular checks on the translations we received from the translation companies, regarding both translation and formatting.

JRC-Acquis
Arab  Table 1: Arab-Acquis data set sizes, and the sizes of the corresponding sentences (4,108 sentences for Dev, 4,107 for rest) in JRC-Acquis.

Arab-Acquis Dataset
In Table 1, we present the final dataset sizes for Arab-Acquis and the respective dataset sizes from the JRC-Acquis English and French portions used to translate it. In total, we created 687,344 translated words.

Translation Analysis
When analyzing the differences in the translations from the English and French sources, we noticed the most variations fall into two categories: Source Language Bias Since different languages have different styles of writing, these differences are reflected in translations from different language sources (Volansky et al., 2015). Valid Alternatives Arabic is a lexically and morphologically rich language; and as such statements can be expressed in different valid styles and sentence structures, and using different alternative wordings that still convey the same meaning. An example of such alternatives is the use of yly 2 and yÂty , which are both valid translations for the word 'following. ' We consider these differences features that make the corpus more suitable to evaluate MT systems by providing more options to express the same concept.

Machine Translation Results
In this section we present the first results ever reported on benchmarking MT between Arabic and 22 European languages in both directions using the same datasets and conditions.

JRC-Acquis MT Systems
We built 21 MT systems for translating from English to X and 21 MT systems for translating from X to English, for X being all of the JRC-Acquis languages, other than English. We built these MT systems using the full JRC-Acquis corpus following the same data splits for training, tuning, and development used by , who reported their work on developing 462 machine translation systems based on the 22 languages of the JRC-Acquis corpus. Their paper included both direct and pivoting-based systems on multiple languages. We replicated the MT systems in (Koehn and Haddow, 2009), in an effort to pivot from/to Arabic through English. We present the MT results for the European languages with English in Table 2. Our results almost match those at . Any minor differences in the scores are mainly attributed to the various upgrades in the toolkits used and tuning variations.
We used the Moses toolkit (Koehn et al., 2007) with default parameters to develop the systems, along with the extra settings used at the original paper; including limiting the training sentence length to 80 words, and the tuning sentences to 8-60 words long only. We used a 5-gram language model. For systems evaluation, we also use BLEU score (Papineni et al., 2002) through the scripts at Moses. To match the settings used at Koehn's paper, we use the case insensitive evaluation feature of BLEU. We used these settings across all experiments, unless explicitly specified.

Arabic-English Systems
We used the Arabic-English parallel component of the UN Corpus to train the Ar-En systems. The UN Corpus has a close parliamentary-styled discourse to JRC-Acquis's, which should reduce the divergence with the rest of JRC-Acquis MT systems. We used about 9 million lines for the Arabic and English language models (circa 286 million words), 2.4 million parallel lines for training (circa 62 million words) and 2000 lines for tuning. We tokenized the Arabic content using the MADAMIRA toolkit (Pasha et al., 2014) with the Alif/Ya normalized ATB scheme (Habash, 2010), and rule-based detokenization (El Kholy and Habash, 2010) for the resulting translations. The English content was tokenized using the available English tokenizer at Moses. For the translations to Arabic, we used the English and French Arabic translations of the Arab-Acquis Dev files as two references for BLEU evaluation. For systems translating from Arabic to English, we used only the Arab-Acquis Arabic translation from the English sources for our tuning.
We compared the performance on an in-domain data set from the UN Corpus with the performance on the Arabic-English dataset from Arab-Acquis. The in-domain results were 43.09 and 39.29 for Ar-En and En-Ar respectively, whereas the outof-domain scored 28.76 and 27.83. As expected, the performance on in-domain data is much better than on out-of-domain. The out-of-domain results reflect the systems used in the pivoting.

Pivoting through English
We used the English part of the shared Arab-Acquis content for pivoting from Arabic into the remainder of the JRC-Acquis languages. This approach can be used to test and validate further pivoting research involving Arabic, with diverse target/input languages. Instead of building MT systems for a given language with Arabic, pivoting can be used as a viable option in many scenarios. We used simple chaining of the source-pivot system and the pivot-target system when translating from/to Arabic and the various JRC-Acquis languages, where the pivot language was always English. We leave exploring more sophisticated pivoting techniques (Utiyama and Isahara, 2007;Habash and Hu, 2009;El Kholy et al., 2013) and newer neural machine translation techniques (Johnson et al., 2016) to future work. The results are presented in Table 2. Table 2 specifies for each language X four BLEU scores for translation from and to English (En→X and X→En), and from and to Arabic via English pivoting (Ar→En→X and X→En→Ar).

Discussion
Direct English MT Our En→X and X→En results are generally comparable to those reported by . The highest BLEU score in the En→X direction is for French, and the worst BLEU score is for Hungarian. The highest BLEU score in the X→En direction is for Maltese, and the worst BLEU score is for Hungarian again. This high BLEU score for Maltese is rather surprising, but consistent with . Although Maltese is a Semitic language, it has a strong Italian (Romance) component; and English is an official language of the nation of Malta. Also, while Maltese is morphologically rich, its writing system has heavy use of hyphens (e.g., il-kondizzjonijiet 'the-conditions') which allows for easy morphological tokenization with simple white space and punctuation tokenization technique used in Moses.
Pivoting through English The BLEU scores for Ar→X and X→Ar via English pivot are to our knowledge the first large scale benchmark of a publicly available data set comparing machine translation from/to Arabic across a large number of languages under identical settings. Not surprisingly, the correlation between the performance on the direct-with-English and pivot-via-English systems is very high: X→En and X→En→Ar correlate at r = 0.97, and En→X and Ar→En→X correlate at r = 0.93. As such, the highest BLEU score in the Ar→En→X direction is for French again, but the worst BLEU score is for Estonian (a relative of Hungarian from the Finno-Ugric family). The highest BLEU score in the X→En→Ar direction is for Maltese again, and the worst BLEU  Correlations Birch et al. (2008) demonstrated that it is possible to predict MT performance using a number of factors: the amount of reordering, the morphological complexity of the target language and the historical relatedness of the two languages. These factors contributed 75% to the variability of the performance of the system.
Our results are consistent with their claims, not only for the direct models which are similar to the models they used but also for those pivoting through English to Arabic. In particular we find the correlation between the word-per-sentence 3 in X to correlate with En→X and Ar→En→X BLEU by r = 0.82 and r = 0.91, respectively.
However the word-per-sentence does not correlate well when X is the source language: X→En and X→En→Ar by r = 0.48 and r = 0.56, 3 The number of words per sentence correlates highly with other measures of morphological complexity like type-totoken ratio (r = −0.96). The intuition here is that a language that uses less words to capture the same sentence meaning is more complex morphologically, e.g., while English average sentence length is 27 in our corpus, Arabic's is 22, and Finnish is 18. respectively. Instead we observe that generally the BLEU scores within each family tend to cluster within a small range. Indeed, if we rank the language families in the order shown in Table 2 form 1 to 7, the correlation between this rank and the X→En BLEU and X→En→Ar BLEU are r = 0.90 and r = 0.93, respectively; while the correlation in the reverse direction does not hold strongly: En→X BLEU and Ar→En→X BLEU correlate with language family rank at r = 0.75 and r = 0.64, respectively.

Conclusions and Future Work
We have presented Arab-Acquis, a large professionally translated and publicly available dataset for MT evaluation between 22 European languages and Arabic. We also presented first benchmarking results on translating to and from Arabic for 22 European languages using this dataset.
In the future, we plan to maximize the use of this dataset by using it in improving MT between all of the 22 languages and Arabic in both directions. We also plan to host a shared task on MT evaluation using parts of Arab-Acquis.