PhoBERT: Pre-trained language models for Vietnamese

We present PhoBERT with two versions of"base"and"large"--the first public large-scale monolingual language models pre-trained for Vietnamese. We show that PhoBERT improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT is released at: https://github.com/VinAIResearch/PhoBERT


Introduction
Pre-trained language models, especially BERT-the Bidirectional Encoder Representations from Transformers [Devlin et al., 2019], have recently become extremely popular and helped to produce significant improvement gains for various NLP tasks. The success of pre-trained BERT and its variants has largely been limited to the English language. For other languages, one could retrain a language-specific model using the BERT architecture Martin et al., 2019;de Vries et al., 2019] or employ existing pre-trained multilingual BERT-based models [Devlin et al., 2019;Conneau et al., 2019;Conneau and Lample, 2019].
In terms of Vietnamese language modeling, to the best of our knowledge, there are two main concerns: (i) The Vietnamese Wikipedia corpus is the only data used to train all monolingual language models , and it also is the only Vietnamese dataset included in the pre-training data used by all multilingual language models except XLM-R [Conneau et al., 2019]. It is worth noting that Wikipedia data is not representative of a general language use, and the Vietnamese Wikipedia data is relatively small (1GB in size uncompressed), while pre-trained language models can be significantly improved by using more data [Liu et al., 2019].
(ii) All monolingual and multilingual models, except ETNLP , are not aware of the difference between Vietnamese syllables and word tokens (this ambiguity comes from the fact that the white space is also used to separate syllables that constitute words when written in Vietnamese). Without doing a pre-process step of Vietnamese word segmentation, those models directly apply Bype-Pair encoding (BPE) methods [Sennrich et al., 2016] to the syllable-level pre-training Vietnamese data. Also, although performing word segmentation before applying BPE on the Vietnamese Wikipedia corpus, ETNLP in fact does not publicly release any pre-trained BERT-based model. 1 As a result, we find difficulties in applying existing pre-trained language models for word-level Vietnamese NLP tasks.
To handle the two concerns above, we train the first largescale monolingual BERT-based "base" and "large" models using a 20GB word-level Vietnamese corpus. We evaluate our models on three downstream Vietnamese NLP tasks: the two most common ones of Part-of-speech (POS) tagging and Named-entity recognition (NER), and a language understanding task of Natural language inference (NLI). Experimental results show that our models obtain state-of-the-art (SOTA) performances for all three tasks. We release our models under the name PhoBERT in popular open-source libraries, hoping that PhoBERT can serve as a strong baseline for future Vietnamese NLP research and applications.

PhoBERT
This section outlines the architecture and describes the pretraining data and optimization setup we use for PhoBERT. Architecture: PhoBERT has two versions PhoBERT base and PhoBERT large , using the same configuration as BERT base and BERT large , respectively. PhoBERT pre-training approach is based on RoBERTa [Liu et al., 2019] which optimizes the BERT pre-training method for more robust performance. Data: We use a pre-training dataset of 20GB of uncompressed texts after cleaning. This dataset is a combination of two corpora: (i) the first one is the Vietnamese Wikipedia corpus (∼1GB), and (ii) the second corpus (∼19GB) is a subset of a 40GB Vietnamese news corpus after filtering out similar news and duplications. 2   , for POS tagging and NER, we append a linear prediction layer on top of the PhoBERT architecture w.r.t. the first subword token of each word. We fine-tune PhoBERT for each task and each dataset independently, employing the Hugging Face transformers for POS tagging and NER and the RoBERTa implementation in fairseq for NLI. We use AdamW [Loshchilov and Hutter, 2019] with a fixed learning rate of 1.e-5 and a batch size of 32. We fine-tune in 30 training epochs, evaluate the task performance after each epoch on the validation set (here, early stopping is applied when there is no improvement after 5 continuous epochs), and then select the best model to report the final result on the test set. Main results: Table 1 compares our PhoBERT scores with the previous highest reported results, using the same experimental setup. PhoBERT helps produce new SOTA results for all the three tasks, where unsurprisingly PhoBERT large obtains higher performances than PhoBERT base .
For POS tagging, PhoBERT obtains about 0.8% absolute higher accuracy than the feature-and neural network-based models VnCoreNLP-POS (i.e. VnMarMoT) and join-tWPD. For NER, PhoBERT large is 1.1 points higher F 1 than PhoBERT base which is 2+ points higher than the featureand neural network-based models VnCoreNLP-NER and BiLSTM-CNN-CRF trained with the BERT-based ETNLP word embeddings . For NLI, PhoBERT outperforms the multilingual BERT and the BERT-based crosslingual model with a new translation language modeling objective XLM MLM+TLM by large margins. PhoBERT also performs slightly better than the cross-lingual model XLM-R, but using far fewer parameters than XLM-R (base: 135M vs. 250M; large: 370M vs. 560M). Discussion: Using more pre-training data can help significantly improve the quality of the pre-trained language models [Liu et al., 2019]. Thus it is not surprising that PhoBERT helps produce better performance than ETNLP on NER, and the multilingual BERT and XLM MLM+TLM on NLI (here, PhoBERT employs 20GB of Vietnamese texts while those models employ the 1GB Vietnamese Wikipedia data).
Our PhoBERT also does better than XLM-R which uses a 2.5TB pre-training corpus containing 137GB of Vietnamese texts (i.e. about 137/20 ≈ 7 times bigger than our pretraining corpus). Recall that PhoBERT performs segmentation into subword units after performing a Vietnamese word segmentation, while XLM-R directly applies a BPE method to the syllable-level pre-training Vietnamese data. Clearly, word-level information plays a crucial role for the Vietnamese language understanding task of NLI, i.e. word segmentation is necessary to improve the NLI performance. This reconfirms that dedicated language-specific models still outperform multilingual ones [Martin et al., 2019].
Experiments also show that using a straightforward finetuning manner as we do can lead to SOTA results. Note that we might boost our downstream task performances even further by doing a more careful hyper-parameter fine-tuning.

Conclusion
In this paper, we have presented the first public large-scale PhoBERT language models for Vietnamese. We demonstrate the usefulness of PhoBERT by producing new state-of-theart performances for three Vietnamese NLP tasks of POS tagging, NER and NLI. By publicly releasing PhoBERT, we hope that it can foster future research and applications in Vietnamse NLP. Our PhoBERT and its usage are available at: https://github.com/VinAIResearch/PhoBERT.