Optimizing Word Segmentation for Downstream Task

Tatsuya Hiraoka; Sho Takase; Kei Uchiumi; Atsushi Keyaki; Naoaki Okazaki

doi:10.18653/v1/2020.findings-emnlp.120

Optimizing Word Segmentation for Downstream Task

Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

Abstract

In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.

Anthology ID:: 2020.findings-emnlp.120
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1341–1351
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.120
DOI:: 10.18653/v1/2020.findings-emnlp.120
Bibkey:
Cite (ACL):: Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2020. Optimizing Word Segmentation for Downstream Task. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1341–1351, Online. Association for Computational Linguistics.
Cite (Informal):: Optimizing Word Segmentation for Downstream Task (Hiraoka et al., Findings 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.findings-emnlp.120.pdf
Code: tatHi/optok
Data: SNLI

PDF Cite Search Code