Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training

Bo Zheng; Li Dong; Shaohan Huang; Saksham Singhal; Wanxiang Che (车万翔); Ting Liu; Xia Song; Furu Wei

doi:10.18653/v1/2021.emnlp-main.257

Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training

Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei

Abstract

Compared to monolingual models, cross-lingual models usually require a more expressive vocabulary to represent all languages adequately. We find that many languages are under-represented in recent cross-lingual language models due to the limited vocabulary capacity. To this end, we propose an algorithm VoCap to determine the desired vocabulary capacity of each language. However, increasing the vocabulary size significantly slows down the pre-training speed. In order to address the issues, we propose k-NN-based target sampling to accelerate the expensive softmax. Our experiments show that the multilingual vocabulary learned with VoCap benefits cross-lingual language model pre-training. Moreover, k-NN-based target sampling mitigates the side-effects of increasing the vocabulary size while achieving comparable performance and faster pre-training speed. The code and the pretrained multilingual vocabularies are available at https://github.com/bozheng-hit/VoCapXLM.

Anthology ID:: 2021.emnlp-main.257
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3203–3215
Language:
URL:: https://aclanthology.org/2021.emnlp-main.257/
DOI:: 10.18653/v1/2021.emnlp-main.257
Bibkey:
Cite (ACL):: Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, and Furu Wei. 2021. Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3203–3215, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Allocating Large Vocabulary Capacity for Cross-Lingual Language Model Pre-Training (Zheng et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.257.pdf
Video:: https://aclanthology.org/2021.emnlp-main.257.mp4
Code: bozheng-hit/vocapxlm + additional community code
Data: MLQA, PAWS-X, TyDiQA, XNLI, XQuAD

PDF Cite Search Code Video Fix data