Finding the Optimal Vocabulary Size for Neural Machine Translation

Thamme Gowda, Jonathan May


Abstract
We cast neural machine translation (NMT) as a classification task in an autoregressive setting and analyze the limitations of both classification and autoregression components. Classifiers are known to perform better with balanced class distributions during training. Since the Zipfian nature of languages causes imbalanced classes, we explore its effect on NMT. We analyze the effect of various vocabulary sizes on NMT performance on multiple languages with many data sizes, and reveal an explanation for why certain vocabulary sizes are better than others.
Anthology ID:
2020.findings-emnlp.352
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3955–3964
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.352
DOI:
10.18653/v1/2020.findings-emnlp.352
Bibkey:
Cite (ACL):
Thamme Gowda and Jonathan May. 2020. Finding the Optimal Vocabulary Size for Neural Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, Online. Association for Computational Linguistics.
Cite (Informal):
Finding the Optimal Vocabulary Size for Neural Machine Translation (Gowda & May, Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.352.pdf
Optional supplementary material:
 2020.findings-emnlp.352.OptionalSupplementaryMaterial.zip
Code
 thammegowda/005-nmt-imbalance