Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Recent progress in neural machine translation (NMT) has made it possible to translate successfully between monolingual language pairs where large parallel data exist, with pre-trained models improving performance even further. Although there exists work on translating in code-mixed settings (where one of the pairs includes text from two or more languages), it is still unclear what recent success in NMT and language modeling exactly means for translating code-mixed text. We investigate one such context, namely MT from code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English. We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs). We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch. We also find LMs fine-tuned on data from various Arabic dialects to help the MSAEA-EN task. Our work is in the context of the Shared Task on Machine Translation in Code-Switching. Our best model achieves 25.72 BLEU, placing us first on the official shared task evaluation for MSAEA-EN.


Introduction
Recent year have witnessed fast progress in various areas of natural language processing (NLP), including machine translation (MT) where neural approaches have helped boost performance when translating between pairs with especially large amounts of parallel data. However, tasks involving a need to process data from different languages mixed together remain challenging for all NLP tasks. This phenomenon of using two or more languages simultaneously in speech or text is referred to as code-mixing (Gumperz, 1982) and is (1) MSAEA .
English Human I want hard work, guys.
GMT I want a rigid job, Jadaan.

S2ST
I want a solid job, jadan.

English Human
The doctors said I can't walk normally again. GMT The doctors said that I was not a normal marginal again.

S2ST
Doctors said I wasn't a natural marginality again. prevalent in multilingual societies (Sitaram et al., 2019). Code-mixing is challenging since the space of possibilities when processing mixed data is vast, but also because there is not usually sufficient codemixed resources to train models on. Nor is it clear how much code-mixing existing language models may have seen during pre-training, and so ability of these language models to transfer knowledge to downstream code-mixing tasks remain largely unexplored.
In this work, we investigate translation under a code-mixing scenario where sequences at source side are a combination of two varieties of the collection of languages referred to as Arabic. More specifically, we take as our objective translating between Modern Standard Arabic (MSA) mixed with Egyptian Arabic (EA) (source; collectively abbreviated here as MSAEA) into English (target). Table1 shows two examples of MSAEA sentences and their human and machine translations. We highlight problematic translations caused by nixing of Egyptian Arabic with MSA. Through work related to the shared task, we target the following three main research questions: 1. How do models trained from scratch on purely MSA data fare on the code-mixed MSAEA data (i.e., the zero-shot EA setting)?
2. How do existing language models perform under the code-mixed condition (i.e., MSAEA)?
3. What impact, if any, does exploiting dialectal Arabic (DA) data (i.e., from a range of dialects) have on the MSAEA code-mixed MT context?
Our main contributions in this work lie primarily in answering these three questions. We also develop powerful models for translating from MSAEA to English. The rest of the paper is organized as follows: Section 2 discusses related work. The shared task is described in Section 3. Section 4 describes external parallel data we exploit to build our models. Section 5 presents the proposed MT models. Section 6 presents our experiments, and our different settings. We provide evaluation on Dev data in Section 7 and official results in Section 8. We conclude in Section 9.

Related Work
A thread of research on code-mixed MT focuses on automatically generating synthetic code-mixed data to improve the downstream task. This includes attempts to generate linguistically-motivated sequences (Pratapa et al., 2018). Some work leverages sequence-to-sequence (S2S) models (Winata et al., 2019) to generate code-mixing exploiting an external neural MT system, while others (Garg et al., 2018) use a recurrent neural network along with data generated by a sequence generative adversarial network (SeqGAN) and grammatical information such as from a part of speech tagger to generate code-mixed sequences. These methods have dependencies and can be costly to scale beyond one language pair.
Arabic MT. For Arabic, some work has focused on translating between MSA and Arabic dialects. For instance, Zbib et al. (2012) studied the impact of combined dialectal and MSA data on dialect/MSA to English MT performance. Sajjad et al. (2013) uses MSA as a pivot language for translating Arabic dialects into English. Salloum et al. (2014) investigate the effect of sentence-level dialect identification and several linguistic features for MSA/dialect-English translation. Guellil et al. (2017) propose an neural machine translation (NMT) system for Arabic dialects using a vanilla recurrent neural networks (RNN) encoder-decoder model for translating Algerian Arabic written in a mixture of Arabizi and Arabic characters into MSA. Baniata et al. (2018) present an NMT system to translate Levantine (Jordanian, Syrian, and Palestinian) and Maghrebi (Algerian, Moroccan, Tunisia) to MSA, and MSA to English. Farhan et al. (2020), propose unsupervised dialectal NMT, where the source dialect is not represented in training data. This last problem is referred to as zero-shot MT (Lample et al., 2018).

Code-Switching Shared Task
The goal of the shared tasks on machine translation in code-switching settings 4 is to encourage building MT systems that translate a source sentence into a target sentence while one of the directions contains an alternation between two languages (i.e., codeswitching). We note that, in the current paper, we employ the wider term code-mixing. The shared task involves two subtasks: 1. Supervised MT. For supervised MT, gold data are provided to participants for training and evaluating models that take English as input and generate Hinglish sequences.
2. Unsupervised MT. In this subtask, the goal is to develop systems that can generate high quality translations for multiple language combinations. These combinations include Spanish-English to English or Spanish, English to Spanish-English, Modern Standard Arabic-Egyptian Arabic (MSAEA) to English and vice versa. For each pair, only test data are provided to participants, with no reference translations.
In the current work, we focus on the unsupervised MT subtask only. More specifically, we build models exclusively for MSAEA to English. Our approach exploits external data to train a variety of models. We now describe these external datasets.

MSA-English Data
In order to develop Arabic MT models that can translate efficiently across different text domains, we make use of a large collection of parallel sentences extracted from the Open Parallel Corpus (OPUS) (Tiedemann, 2012). OPUS contains more than 2.7 billion parallel sentences in 90 languages.
To train our models, we extract more than ∼ 61M sentences MSA-English parallel sentences from the whole collection. Since OPUS can have noise and duplicate data, we clean this collection and remove duplicates before we use it. We now describe our quality assurance method for cleaning and deduplication of the data. Data Quality Assurance. To keep only high quality parallel sentences, we follow two steps: 1. We run a cross-lingual semantic similarity model (Yang et al., 2019) on each pair of sentences, keeping only sentences with a bilingual similarity score between 0.30 and 0.99. This allows us to filter out sentence pairs whose source and target are identical (i.e., similarity score = 1) and those that are not good translations of one another (i.e., those  with a cross-lingual semantic similarity score < 0.3).
2. Observing some English sentences in the source data, we perform an analysis based on sub-string matching between source and target, using the word trigram sliding window method proposed by Barrón-Cedeño and Rosso (2009) and used in Abdul-Mageed et al.
(2021) to de-duplicate the data splits. In other words, we compare each sentence in the source side (i.e., MSA) to the target sentence (i.e., English). We then inspect all pairs of sentences that match higher than a given threshold, considering thresholds between 90% and 30%. We find that a threshold of > 75% safely guarantees completely distinct source and target pairs.
More details about the MSA-English OPUS dataset before and after our quality assurance, including deduplication, are provided in Table 2.

Dialectal Arabic-English Data
Several recent works show that MT models trained on one dialect can be used to improve models targeting other dialects (Farhan et al., 2020;Sajjad et al., 2020b).  Bouamor et al. (2018) translate 10k more sentences for five selected cities: Beirut, Cairo, Doha, Cairo, Tunis, and Rabat. The MADAR dataset also has region-level categorization (i.e., Gulf, Levantine, Nile, and Maghrebi). In our work, we use only the Gulf, Levantine, and Nile (Egyptian) dialects, and exclude Maghrebi. 5 Qatari-English Speech Corpus. This parallel corpus comprises 14.7k Qatari-English sentences collected by Elmahdy et al. (2014) from talk-show programs and Qatari TV series. More details about all our parallel dialectal-English datasets are in Table 3.

Data Splits and Pre-Processing
Data Splits. For our experiments, we split the MSA and DA data as follows: MSA. We randomly pick 10k sentences for validation (MSA-Dev) from MSA parallel data (see Section 4.1) after cleaning, and we use the rest of this data (∼ 55.14M) for training (MSA-Train). DA. For validation (DA-Dev), we randomly pick 6k sentences from the 38k Egyptian-English data provided by Zbib et al. (2012). We then use the rest of the data (i.e., ∼ 250.7k) for training (DA-Train). Pre-Processing. Pre-processing is an important step for building any MT model as it can significantly affect end results (Oudah et al., 2019). For all our models, we only perform light preprocessing in order to retain a faithful representation of the original (naturally occurring) text. We remove diacritics and replace URLs, user mentions, and hashtags with the generic string tokens URL, USER, and HASHTAG respectively. Our second step for pre-processing is specific to each type of models we train as we will explain in the respective sections.

From-Scratch Seq2Seq Models
We train our models on the MSA-English parallel data described in section 4.1 on MSA-Train with a Transformer (Vaswani et al., 2017) model as implemented in Fairseq (Ott et al., 2019). For that, we follow  in using 6 blocks for each of the encoder and decoder parts. We use a learning rate of 0.25, a dropout of 0.3, and a batch size 4, 000 tokens. For the optimizer, we use Adam (Kingma and Ba, 2014) with beta coefficients of 0.9 and 0.99 which control an exponential decay rate of running averages, with a weight decay of 10 −4 . We also apply an inverse square-root learning rate scheduler with a value of 5e −4 and 4, 000 warmup updates. For the loss function, we use label smoothed cross entropy with a smoothing strength of 0.1. We run the Moses tokenizer (Koehn et al., 2007) on our input before passing data to the model. For vocabulary, we use a joint Byte-Pair Encoding (BPE) (Sennrich et al., 2015) vocabulary with 64K split operations for subword segmentation.

Pre-Trained Seq2Seq Language Models
We also fine-tune two state-of-the-art pre-trained multlingual generative models, mT5 (Xue et al., 2020) and mBART (Liu et al., 2020) on DA-Train for 100 epochs. We use early stopping during finetuning and identify the best model on DA-Dev. We use the HuggingFace (Wolf et al., 2020) implemen-60 Source: .

S2ST
we don't know for sure and the girls don't know finn .

mT5
we can't make sure and we don't know where the girls are mBART we don't know where to make sure and we don't know where the girls are Source: .

S2ST
i want to know the brothers' official position on harassment of liberals and nejad al-barai, even the thugs, countries that are not followed by the president are using his authority and ordering their immediate arrest.
mT5 i want to know the situation of the official brothers from harassment of the silky and najad albarea and if these pants are not their president the president uses his power and order to arrest them immediately mBART i want to know the position of the official brothers from harassment in the army and najad al-bara'y, even if these are not theirs , the president should use his authority and order to arrest them immediately Source: " user : ." S2ST user: there is a need for a lawyer to help the section, jadan sobhi and walid televonas closed .
mT5 user: we want a lawyer to go with us to the section , guys , sobhe and waleed their telephones are closed mBART « user : we want a lawyer to go with us to the section, oh good morning, and their telephones are closed. » Source S2ST they hold hearings in places where there are no courts, and what thrives on god's creation will enter without permission, because the accused will not prevent them and judge my absence! mT5 they have sessions in places that are not courts, and god doesn't allow people to enter without a permit, so that when they come and prevent them and rule me absence mBART they hold meetings in places where there is no courts, and god doesn't allow people to enter without a permit, so that when the accused come they stop them and rule them Table 4: MSA-EA sentences with their English translations using our Models. S2ST: Sequence-to-sequence Transformer model trained from scratch. Data samples are extracted from the shared task Test data. Green refers to good translation. Red refers to problematic translation.
tation of each of these models, with the default settings for all hyper-parameters.

Experiments and Settings
In this section, we describe the different ways we fine-tune and evaluate our models.

Zero-Shot Setting
First, we use S2ST model trained on MSA-English data exclusively to evaluate MSAEA code-mixed data . While we can refer to this setting as zero-shot, we note that it is not truly zero-shot in the strict sense of the word due to the code-mixed nature of the data (i.e., the data has a mixture of MSA and EA). Hence, we will refer to this setting as zero-shot EA.

Fine-Tuning Setting
Second, we further fine-tune the three models (i.e., S2ST, mT5, and mBART) on the DA data described in Section 4.2. While the downstream shared task data only involves EA mixed with MSA, we follow Farhan et al. (2020) and Sajjad et al. (2020b) in fine-tuning on different dialects when targeting a single downstream dialect (EA in our case). We will simply refer to this second setting as Fine-Tuned DA.

Evaluation on Dev Data
We report results of all our models under different settings in BLEU scores (Papineni et al., 2002).
In addition to evaluation on uncased data, we run a language modeling based truecaser (Lita et al., 2003) on the outputs of our different models. 6 Results presented in Table 5 show that S2ST achieves relatively low scores (between 8.54 and 12.57) on all settings. In comparison, both mBART and mT5 fine-tuned on DA-Train are able to translate MSAEA to English with BLEU scores of 23.80 and 24.70 respectively. We note that truecasing the output results in improving the results with an average of +2.55 BLEU points.   Discussion. We inspect output translations from our models on Test data. We observe that even though S2ST performs better than the two language models on Test data, both of these models are especially able to translate Egyptian Arabic tokens such as in example (1) in Table 4 well. Again, Test data contain more MSA than DA as we explained earlier and hence the S2ST model (which is trained on 55M sentence pairs) outperforms each of the two language models. This analysis suggests that fine-tuning the language models on more MSA-ENG should result in better performance.

Official Shared Task (Test) Results
Returning to our three main research questions, we can reach a number of conclusions. For RQ1, we observe that models trained from scratch on purely MSA data fare reasonably well on the codemixed MSAEA data (i.e., zero-shot EA setting). This is due to lexical overlap between MSA and EA. For RQ2, we also note that language models such as mT5 and mBART do well under the codemixed condition, more so than models trained from scratch when inference data involve more EA. This is the case even though these language models in our experiments are fine-tuned with significantly 6 less data (i.e., ∼ 250K pairs) than the from-scratch S2ST models (which are trained on 55M MSA + 250K DA pairs). For RQ3, our results show that training on data from various Arabic dialects helps translation in the MSAEA code-mixed condition. This is in line with previous research (Farhan et al., 2020) showing that exploiting data from various dialects can help downstream translation on a single dialect dialect in the zero-shot setting.

Conclusion
We described our contribution to the shared tasks on MT in code-switching. 7 Our models target the MSAEA to English task under the unsupervised condition. Our experiments show that training models on MSA data is useful for the MSAEAto-English task in the zero-shot EA setting. We also show the utility of pre-trained language models such as mT5 and mBART on the code-mixing task. Our models place first in the official shared task evaluation. In the future, we intend to apply our methods on other dialects of Arabic and investigate other methods such as backtranslation for improving overall performance.