An Empirical Study of Pre-trained Transformers for Arabic Information Extraction

Multilingual pre-trained Transformers, such as mBERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020a), have been shown to enable the effective cross-lingual zero-shot transfer. However, their performance on Arabic information extraction (IE) tasks is not very well studied. In this paper, we pre-train a customized bilingual BERT, dubbed GigaBERT, that is designed speciﬁ-cally for Arabic NLP and English-to-Arabic zero-shot transfer learning. We study Giga-BERT’s effectiveness on zero-short transfer across four IE tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Our best model signiﬁ-cantly outperforms mBERT, XLM-RoBERTa, and AraBERT (Antoun et al., 2020) in both the supervised and zero-shot transfer settings. We have made our pre-trained models publicly available at https://github.com/ lanwuwei/GigaBERT .


Introduction
Fine-tuning pre-trained Transformer models (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019) has recently achieved state-of-the-art results on a wide range of NLP tasks where supervised training data is available. When trained on multilingual corpora, BERT-based models have demonstrated the ability to learn multilingual representations that support zero-shot cross-lingual transfer learning surprisingly effectively (Wu and Dredze, 2019;Pires et al., 2019;Lample and Conneau, 2019).
Without access to any parallel text or target language annotations, multilingual BERT (mBERT; Devlin et al., 2019) even supports cross-lingual transfer for language pairs that are written in different scripts, for example, English-to-Arabic. However, the transfer learning performance still lags far behind where supervised data is available in the target language. In this paper, we explore to what extent it is possible to improve performance in the zero-shot scenario by building a customized bilingual BERT for English and Arabic, a particularly challenging language pair for cross-lingual transfer learning.
We present GigaBERT, a customized BERT for English-to-Arabic cross-lingual transfer that is trained on newswire text in the Gigaword corpus (Graff et al., 2003;Parker et al., 2009) in addition to Wikipedia and web crawl data. We systematically compare our pre-trained models of different configurations against the mBERT (Devlin et al., 2019) and XLM-RoBERTa (XLM-R; Conneau et al., 2020a). By using a customized vocabulary and code-switched data specifically created for English-to-Arabic transfer learning, our GiagBERT outperforms mBERT and XLM-R base (both support more than 100 languages) on a range of IE tasks, including named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Further performance gains are demonstrated by augmenting the pre-training corpus with synthetically generated code-switched data. This demonstrates the usefulness of anchor points for zero-shot cross-lingual transfer learning. GigaBERT also performs well when annotated Arabic data is available, outperforming AraBERT (Antoun et al., 2020), the state-of-the-art Arabic-specific BERT model, on various Arabic IE tasks.

GigaBERT
We present five versions of GigaBERT pre-trained using the Transformer encoder (Vaswani et al., 2017) with BERT base configurations: 12 attention layers, each has 12 attention heads and 768 hidden dimensions, which attributes 110M parameters. Table 1 shows a detailed summary of the training data and model parameters.

Training Data
We pre-train our GigaBERT models using the fifth edition of English and Arabic Gigaword corpora. 1 The Gigaword data consists of 13 million news articles 2 and matches the domain of many NLP tasks. We split English and Arabic sentences without tokenization by a modified version of the Stanford 1 https://catalog.ldc.upenn.edu/ LDC2011T07 and https://catalog.ldc.upenn. edu/LDC2011T11 2 We flattened the Gigaword data with https:// github.com/nelson-liu/flatten_gigaword. CoreNLP tool (Manning et al., 2014). 3 We also add Wikipedia data processed by WikiExtractor 4 for better coverage. As the English Wikipedia (total 2.5B tokens) is much larger than the Arabic Wikipedia (total 0.15B tokens), we balance the pretraining data by (1) up-sampling the Arabic data by repeating the Wikipedia portion five times and the Gigaword portion three times; (2) adding the Arabic section of the Oscar corpus (Ortiz Suárez et al., 2019), a large-scale multilingual dataset filtered from the Common Crawl.
Code-Switched Data Augmentation. To further improve cross-lingual transfer capability, we leverage English-Arabic dictionaries to create synthetic code-switched training data (Conneau et al., 2020a). We experimented with three dictionaries: PanLex (Kamholz et al., 2014), MUSE (Conneau et al., 2018), and Wikipedia parallel titles. We extract parallel article titles in Wikipedia based on the inter-language links and the entities based on the Wikidata (Jiang et al., 2020). 5 The dictionaries of PanLex, MUSE, Wikipedia contain 24K, 44K, 2M entries, respectively, and on overage 4.6, 1.4 and 1 translations per entry (English or Arabic). For training GigaBERT-v4, we code-switch up to 50% random sentences for both English and Arabic and up to 30% of tokens for each sentence. During the replacement process, we prioritize substitutions based on the Wikipedia titles, then PanLex and MUSE if the proportion of tokens being replaced has not reached 30% for a given sentence.

Vocabulary
The vocabulary size is critical to the performance of pre-training models, as it directly impacts the subword granularity and the number of parameters. The original English BERT (Devlin et al., 2019) uses a 30k vocabulary size for ∼3B tokens of training data, while the multilingual BERT and XLM-R have ∼5k and ∼14k Arabic subwords in their vocabularies respectively (Table 1). 6 We choose a vocabulary size of 50k for our GigaBERT models based on preliminary experiments. For GigaBERT-v0, we use the unigram language model in the Sen-tencePiece (Kudo and Richardson, 2018) to create 30k cased English subwords and 20k Arabic subwords separately. 7 For GigaBERT-v1/2/3/4, we did not distinguish Arabic and English subword units, instead, we train a unified 50k vocabulary using WordPiece (Wu et al., 2016). 8 The vocabulary is cased for GigaBERT-v1 and uncased for GigaBERT-v2/3/4, which use the same vocabulary.

Optimization
We use the official implementation of BERT (Devlin et al., 2019) in TensorFlow for pre-training. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-4, β 1 = 0.9, β 2 = 0.999, L2 weight decay of 0.01. The learning rate is warmed up over the first 100,000 steps to a peak value of 1e-4, then linearly decayed. The dropout is set to 0.1 for all layers. We use the whole word mask for GigaBERT-v0 and the regular subword mask for v1/2/3/4. The batch size is set to 512. GigaBERT-v0/1/2 are trained for 1.2 million steps on Google Cloud TPUs with a max sequence length of 128. GigaBERT-v3 is additionally trained for 140k steps with a max sequence length of 512. The maximum number of masked LM predictions per sequence is set to 20 when max sequence length is 128 and set to 80 when max sequence length is 512. GigaBERT-v4 is trained from the GigaBERT-v3 checkpoint for another 140K steps on the codeswitched data. We also experiment with different thresholds for the code-switched data augmentation, as well as training models from scratch on the code-switched data (Appendix A). 6 We check the Unicode range of characters to classify word pieces as English or Arabic. 7 There are 633 word pieces shared by both languages. We add 633 unused symbols (e.g., unused-1, unused-2, etc.) to make up the 50k combined vocabulary. 8 We use Hugging Face's implementation: https:// github.com/huggingface/tokenizers   (2019).
In the ARL fine-tuning experiment, we pair each trigger with its argument mentions as positive instances and with other entities in the sentence as negative instances. As for RE, we use gold relation mentions as positive examples and create negative examples by randomly pairing two entities in a sentence. We perform these tasks following the same fine-tuning pipeline as BERT (Devlin et al., 2019). We feed input sentences into a pre-trained model, then extract the necessary hidden representations, i.e., all token representations for NER/POS and argument/entity spans for ARL/RE, before applying one linear layer for classification. We evaluate for each language in the standard supervised learning setting, as well as the zero-shot transfer learning setting from English to Arabic, where the model is trained on the annotated English training data and evaluated on the Arabic test set.

Implementations
We implement the fine-tuning experiments with the PyTorch framework (Paszke et al., 2019) and choose hyperparameters by grid search. 9 We set the  learning rate to 2e-5, batch size to 8, max sequence length to 128, and the number of fine-tuning epochs to 7. Some exceptions include a learning rate of 1e-4 in NER experiments, max sequence length of 512, and batch size of 4 in RE experiments. For RE, we also use gradient accumulation to simulate the larger batch size of 32 when using models based on BERT large architecture. Table 3 shows experimental results for the pretrained models on both English and Arabic IE tasks. For the zero-shot transfer (en → ar), we report two scores on the Arabic test set, where the best checkpoint is selected based on the English dev set and the Arabic dev set, respectively. In summary, we find the key factors of improved pre-training performance are a large amount of training data in the target language, customized vocabulary, longer max length of sentence, and more anchor points from code-switched data. We also add experiments with XLM-R large models as a reference, but the comparison focuses on the pre-trained models with BERT base configuration for fairness.

Results and Analysis
Single-language Performance. All versions of GigaBERT perform very competitively, especially the GigaBERT-v3/4. After adding Wikipedia and Oscar data, GigaBERT-v2 starts to outperform mBERT and XLM-R base on most tasks. We find it crucial to continue training GigaBERT-v2 with a longer max sentence length of 512 word pieces, as the resulting GigaBERT-v3 model shows im-provements in all four IE tasks. GigaBERT-v3 also outperforms AraBERT (Antoun et al., 2020), the state-of-the-art Arabic-specific BERT model by a large margin, showing that our bilingual GigaBERT does not sacrifice per-language performance. It is worth noting that GigaBERT-v4 also has competitive single-language performance after training on the synthetically created code-switched data.
Cross-lingual Zero-shot Transfer Learning. All pre-trained models show varied performance when we select checkpoints based on the English dev set and Arabic dev set, indicating that the best single-language performance does not necessarily imply the best cross-lingual performance. Compared to GigaBERT-v0, additional data used to train GigaBERT-v1/2 helps improve zero-shot transfer capability, even though the added data is not from the news domain. Different from previous works (Wu and Dredze, 2019; Pires et al., 2019) that attribute cross-lingual ability to shared subwords, GigaBERT-v3 has nearly no shared word pieces or scripts between English and Arabic, but still shows strong cross-lingual performance. We hypothesize the Transformer encoder projects similar contextual representations and enables cross-lingual transfer (Conneau et al., 2020b).
Code-Switched Pre-training. We show that we can further improve GigaBERT's cross-lingual transfer capability with a carefully designed codeswitching procedure. Our GigaBERT-v4 pretrained with code-switched data shows significant improvement over GigaBERT-v3, achieving Wikipedia titles, while MUSE appears to be the most effective; 2) we keep at least half of the sentences unchanged to balance between real data and artificial data. In practice, the generated data for GigaBERT-v4 has 47.4% of the sentences codeswitched. We present more comparison experiments using varied code-switching mixes and different bilingual lexicons in Appendix A.
Domain-adapted Pre-training. We also explore whether XLM-RoBERTa can be improved by additional pre-training on Gigaword data, as Gururangan et al. (2020) have shown that the continued pre-training with in-domain data is helpful. We create GigaXLM-R models by continuing pre-training from XLM-R base and XLM-R large checkpoints in the Fairseq toolkit (Ott et al., 2019) for 500k steps on shuffled Arabic and English Gigaword corpus (max sequence length 512 and batch size 4). Although only ∼1% of the Gigaword corpus is used in this continued training step due to computing resource limit, GigaXLM-R still improves zero-shot transfer performance for NER, POS, and RE over the original XLM-R models as shown in Table 3. We could expect more performance improvement with a larger batch size and longer training time.
Embedding Space Analysis. We further analyze the semantic similarity of parallel English-Arabic sentence representations and find that GigaBERT is able to distinguish parallel sentences from randomly paired sentences more effectively compared to its counterparts. Our hypothesis is that crosslingual representations for parallel English-Arabic sentences should be similar, but randomly paired sentences should be dissimilar. To evaluate crosslingual similarity, we extract sentence representation of 5340 English-Arabic parallel sentences from the GALE corpus 10 and the same number of randomly paired sentences with pre-trained models across all 12 layers. We use the average of hidden representations, excluding [CLS] and [SEP], as a sentence representation. Cosine similarity is calculated for each sentence pairs and averaged across the whole corpus. In Figure 1, GigaBERT shows high similarity between parallel sentences and low similarity between randomly paired sentences. A clear separation for two types of paired sentences is shown across all the layers. In contrast, XLM-R is not able to distinguish between them but shows high similarity scores. mBERT shows low similarity in both cases. This suggests that our Gi-gaBERT preserves language independent semantic information in the sentence representations, which might contribute to the competitive performance in downstream IE tasks.

Conclusions
In this paper, we show that the performance of zero-shot cross-lingual transfer can be improved by training customized bilingual BERT for a given language pair and text domain. We pre-trained several masked language models (GigaBERTs) for Arabic-English and conducted a focused study on information extraction tasks in the newswire domain. The experiments show that our GigaBERT model outperforms multilingual BERT, XLM-RoBERTa, and the monolingual AraBERT on NER, POS, ARL and RE tasks. We also achieve the new state-of-the-art performance fro zero-shot transfer learning from English to Arabic. We additionally studied codeswitched pre-training for GigaBERT and domainadapted pre-training for XLM-RoBERTa.

A Comparison Experiments for Code-Switched Pre-training
Given the English and Arabic monolingual corpus and the bilingual lexicons, we have different thresholds to control the code-switched data generation: 1) the percentage of sentences being code-switched within the whole corpus, we set sentence replacement threshold to limit the changed sentences; 2) the percentage of tokens being replaced within the sentence, we set token replacement threshold to limit the changed tokens; 3) the choice of bilingual lexicons, where we explore different combinations of PanLex, MUSE and Wiki titles. With the generated code-switched data, we can pre-train Giga-BERT from scratch or load the existing checkpoint (GigaBERT-v3) for continued pre-training, which are s1 and s2 in Table 4, respectively. As shown in Table 4, it's better to keep some sentences unchanged for code-switched pre-training. The continued pre-training (s2) shows slightly better performance than that training from scratch (s1). During the data augmentation, we need to keep a relatively low ratio for token replacement. The results also reveal that the MUSE dictionary is very promising, which outperforms the combinations of all dictionaries in some cases.