You should evaluate your language model on marginal likelihood over tokenisations

Kris Cao, Laura Rimell


Abstract
Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate a pretrained language model on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.
Anthology ID:
2021.emnlp-main.161
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2104–2114
Language:
URL:
https://aclanthology.org/2021.emnlp-main.161
DOI:
10.18653/v1/2021.emnlp-main.161
Bibkey:
Cite (ACL):
Kris Cao and Laura Rimell. 2021. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2104–2114, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
You should evaluate your language model on marginal likelihood over tokenisations (Cao & Rimell, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.161.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.161.mp4
Data
mC4