Optimizing Word Segmentation for Downstream Task

In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task. To address this issue, we propose a novel method to explore a tokenization which is appropriate for the downstream task. Our proposed method, optimizing tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a vector representation of a sentence such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis and textual entailment. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings and report a positive effect.


Introduction
Tokenization is a fundamental problem in natural language processing (NLP). We must split a given sequence into a sequence of words for languages that do not contain obvious boundaries, such as Chinese and Japanese. In addition, it is also better to explore appropriate segmentations for languages containing obvious boundaries indicated by whitespaces, such as English Dredze, 2015, 2016;Sennrich et al., 2016;He and Sun, 2017;A and Augenstein, 2020;Bollegala et al., 2020).
In traditional NLP, we tokenize a given sentence as a preprocessing. Thus, as shown in Figure 1(a), we apply an existing tokenizer to the given sentence, and then input the tokenized sentence into a model for a target downstream task. In the conventional approach, we obtain the most plausible tokenized sentence based on the tokenizer; however, some studies have varied the tokenization using a sampling during the training to enable the downstream model to adapt to various tokenizations (Kudo, 2018;Hiraoka et al., Figure 1: Overview of (a) conventional tokenization and (b) optimizing tokenization proposed herein. We directly optimize the tokenizer to improve the performance of the model for a downstream task using the loss of the target task. et al., 2019). Although such a strategy makes the downstream model robust, little attention has been paid to optimizing the tokenizers for a downstream task. Thus, if we acquire an appropriate tokenization to a downstream task, we might improve the task performance. By contrast, some studies have used multiple tokenized sentences to prevent the damage depending on the tokenization (Chen et al., 2017;. Their methods compute various tokenizations for a given sentence, and then encode the tokenizations using an architecture based on the LSTM (Hochreiter and Schmidhuber, 1997). Although their methods prevent the error propagation from the tokenizer, they are intractable when handling all possible tokenizations owing to the computational costs required.
This paper describes an exploration into an appropriate tokenization to the downstream tasks. We propose a novel method to optimize a tokenizer based on the downstream task, as shown in Figure Figure 2: Outline of the proposed method for calculating a sentence vector h s with the 3-best tokenizations during the training phase. At the inference, we use the 1-best tokenization as well as general neural architectures. The arrows along the continuous line indicate the differentiable paths for backpropagation. We can use various architectures as the Encoder, which converts a sequence of tokens into a single vector. Downstream Model is the architecture for the downstream tasks, i.e., MLP for text classification.
1(b) 1 . The proposed method generates multiple tokenized sentences as candidates and inputs them into the downstream model. We then update the parameters of the tokenizer to decrease the training loss, and the tokenizer should therefore output a better tokenization for the downstream task. We design the proposed method to be used for any downstream task that uses a vector representation of a sentence. We conduct experiments on text classification in three languages, and show the effectiveness of the proposed method. Moreover, we indicate that we can also introduce our proposed method into the pre-trained architecture. We combine the proposed method with the state-of-the-art contextualized embeddings, BERT (Devlin et al., 2018), and improve its performance.

Model Outline
We propose a new architecture for optimizing tokenization, OpTok. OpTok explores an appropriate tokenization for a downstream task. In other words, OpTok explores a tokenization that yields a better score for a downstream task. Formally, OpTok converts a given sentence s into a sequence of tokens in vocabulary w ∈ V , i.e., s = w 1 ...w i ...w I , where I is the number of tokens included in the sentence. In addition, the downstream model achieves the best score with s among all possible tokenized sentences. Thus, let q(·) be an evaluation function, z be a ground truth of the downstream task, and f (·) be the downstream model, i.e., any neural architecture, and we search the tokenization s that maximizes the score of the downstream task: argmax s (q(z, f (s ))). To search s satisfying argmax s (q(z, f (s ))), we train OpTok based on the score of the downstream task (q(z, f (s ))). Thus, we optimize both OpTok and the downstream model simultaneously in contrast to a traditional pipeline approach, which tokenizes a given sentence as a preprocessing. Op-Tok generates multiple tokenized sentences as candidates, and we train OpTok to assign a high probability to a better tokenization based on the score of the downstream task. During the inference step, we make OpTok output only the most plausible tokenized sentence to reduce the computational costs. Figure 2 shows an overview of OpTok with the downstream model during training. OpTok constructs N tokenized sentences and converts them into vector representations with a neural encoder. Then, OpTok combines the probabilities of each tokenization with the vector representations. We compute the sum of the vector representations weighted by the probabilities, and then input it into the downstream model. Thus, OpTok becomes to assign a high probability to the tokenization which improves the performance of the downstream task. We therefore can obtain s satisfying argmax s (q(z, f (s ))) through the training. We describe the details of each module in this section.

Neural Unigram Language Model
OpTok calculates the probability of a token p(w) with a neural unigram language model as follows: where MLP is a multilayer perceptron containing trainable parameters, and v w is an embedding of the word w.
To stabilize the learning, as explained in Section 2.5, we employ the smoothed distribution of unigram probability (Kudo, 2018) with a hyperparameter α. Concretely, we obtain the smoothed probability as p * (w) = p(w) α ŵ∈V p(ŵ) α . We convert a sentence into a sequence of tokens depending on a probability of a tokenized sentence: We initialize vocabulary V with a reasonable number of tokens. To choose the initial vocabulary, both supervised and unsupervised word segmentation methods are available, e.g., publicly available pre-trained tokenizers (Kudo, 2006;Yang et al., 2017) and vocabulary acquired using unsupervised word segmentation (Goldwater et al., 2006;Mochihashi et al., 2009;Sennrich et al., 2016). In this study, we use SentencePiece (Kudo and Richardson, 2018) for initialization.

Module for Selecting Tokenization
OpTok generates multiple tokenized sentences as candidates and converts them into a single vector using their probabilities during the training phase.
First, we obtain the N -best tokenization of the sentence s 1 , ..., s n , ..., s N . We obtain the N -best tokenization using the Forward-DP Backward-A* algorithm (Nagata, 1994) for the probabilities produced using the language model mentioned in Section 2.2.
Second, we convert the tokenized sequences into the vectors h s n severally as follows: where g(·) is a neural encoder, which encodes the sequence of tokens, such as those using a CNN and BiLSTM. We found that the learning is stabilized by sharing word embeddings between the encoder and the neural unigram language model. Finally, we calculate the final vector of the sentence by weighting the vectors of the candidates using their probabilities calculated through Eq. (3) as follows: Similarly to the attention mechanism, we normalize the probability to meet a restriction N n=1 a n = 1 2 . We can use such a vector h s in the same way as the general encoded vectors. For example, we can construct a neural text classifier by converting h s into a label-sized vector with an MLP. Updating the entire model with the training loss such as the crossentropy loss against the gold label, the language model becomes to assign the higher probability to the useful tokenization for the downstream task. At the inference, we obtain the optimal tokenization using the Viterbi algorithm (Viterbi, 1967).

Restricting Vocabulary
To mitigate the local optima which uses longer and more unique tokens for each sentence, we introduce a restriction for the size of the vocabulary during training. Concretely, OpTok constructs the restricted vocabulary V sampled from the original vocabulary V , where |V | ≤ |V |, at the beginning of each mini-batch and uses V as the vocabulary in the mini-batch. The sampling is processed based on the smoothed probability of tokens p * (w), mentioned in Section 2.2. Then, we calculate the new probability distribution of tokens in V by normalizing probabilities of them. Moreover, OpTok prepares the embeddings for entire tokens in V but we treat a token outside V as an unknown token. At the inference, we construct vocabulary by taking the top-|V | tokens from V based on the updated token probabilities obtained by Eq. (2).
Such sampling of the vocabulary results in the diversity of tokenization in the N -best candidates during training. Setting the lower α mentioned in Section 2.2, the distribution of the tokens becomes flatter, and the model can sample various tokens for V . In addition, through the sampling process, we can reduce the importance of words that are unuseful in V for the downstream task. This procedure is related to a vocabulary restriction with a continuous cache technique (Grave et al., 2016;Kawakami et al., 2017).

Maintaining Nature of Language Model
Since the optimization of OpTok only depends on the loss function for the downstream task, the language model of OpTok might be much different from the unigram language model (i.e., frequency  of words) obtained from the training corpus. Meanwhile, we have to keep the corpus-based language model in some cases. To address such cases, we can use the following loss for the sentence s to update the language model using neural EM algorithm (Deligne and Bimbot, 1995;Liang and Klein, 2009;Tran et al., 2016): We then optimize the weighted sum of the downstream task loss and L lm s . Consider text classification as an example. We use cross-entropy loss for the ground-truth label of the sentence L cl s . Thus, we optimize the following equation: where µ is the hyperparameter. Note that we set µ = 0 to confirm the effect of the proposed method in this study.

Experiments
The goal of this study is to improve the performance of downstream tasks by optimizing the tokenization. Therefore, we evaluate OpTok on various text classification tasks to validate its effect.

Dataset
We evaluate OpTok on text classification, in which a model predicts the label from a text as its input.
To confirm the effectiveness of our method on various languages, we utilize datasets in a sentiment analysis for Chinese, Japanese, and English. We employed the corpora on the SNS domain because they have many informal expressions, and thus the difference in tokenization has numerous effects on the performance of the text classification. In addition, we also conducted experiments on the dataset whose sentence contains two kinds of labels to investigate whether OpTok finds different tokenization for each label. Furthermore, we used a textual entailment dataset to indicate that our OpTok can be applied to the task providing two sentences as input. We describe the details of these datasets in the following. Weibo(Zh) 3 is the dataset including short Chinese texts on an SNS with two sentiment labels: positive or negative. Because the available data are already tokenized with a preprocesser, we detokenize them by removing the whitespaces. Twitter(Ja) 4 is a dataset of short Japanese texts from an SNS about products such as electric appliances. The samples of this dataset initially have five sentiment labels for the target topic: positive, negative, neutral, both positive and negative, and unrelated. As of the summer of 2018, 352,554 tweets were available, and we extracted only tweets with a single sentiment label of positive, negative, or neutral. In other words, we removed both positive and negative and unrelated to prevent confusion. Twitter(En) 5 is a dataset of short English texts from an SNS with two sentiment labels: positive or negative. We exploited this corpus without any preprocessing.
SNLI (Bowman et al., 2015) is a widely used dataset for recognizing a textual entailment, which is a text classification requiring two input sentences in English. We employed this dataset to validate the performance of OpTok when using multiple sentences. We used the default split of this corpus and only applied the labeled samples following the existing studies.
Genre&Rating are datasets in English that we created from Amazon product data 6 , which has reviews from 24 product genres, in which each review has an attached rating from a user of 1 to 5. We sampled 5K reviews from each product genre.
In this process, we counted the number of tokens in each review based on whitespaces and removed the review which contains more than 200 tokens. We used sampled reviews for rating prediction and genre prediction tasks from the same review texts.
For the sentiment analysis, we randomly split each dataset into a ratio of 8:1:1 for training, validation, and testing. We also split the dataset of the genre and rating prediction into a ratio of 8:1:1 for a well-balanced genre, in which both tasks share the same split. Table 1 shows an overview of each dataset of the sentiment analysis.

Experimental Settings
For the unigram language model in OpTok, we used two-layered perceptron as MLP in Eq. (1). We used BiLSTM and a linear layer as an encoder to compute h s in Eq. (4). We applied BiLSTM to the tokenized sentence based on the unigram language model, and then fed max-pooled outputs to the linear layer. In this procedure, we applied activation function tanh before and after the linear layer. Then, we applied a dropout to the sentence representations with a rate of 0.5. For SNLI, we shared parameters between encoders for the premise and hypothesis and concatenated both encoded representations. As the downstream model, we used three-layered perceptron which outputs a label-sized vector. We compared our OpTok with Sentence-Piece (Kudo and Richardson, 2018), which is a widely used tokenizer. Concretely, we obtained a tokenized sentence based on SentencePiece, and then treated the tokenized sentence as an input to the encoder. In other words, we replaced the unigram language model in OpTok with the Sentence-Piece tokenizer and used one tokenized sentence as an input to the same architecture. Moreover, many studies have reported that training models with a stochastic tokenization lead to a better performance of the downstream tasks than training a model using deterministic tokenization (Kudo, 2018;Hiraoka et al., 2019;Provilkov et al., 2019). Thus, we trained the encoder and downstream model using subword regularization provided by SentencePiece.
We trained the tokenizer model of SentencePiece on the training split of each dataset. We searched the size of the vocabulary among 8K, 16K, 24K, and 32K, and we selected 16K for Twitter(Ja) and Twitter(En), and 32K for Weibo(Zh), SNLI, and Genre&Rating. We also use a vocabulary obtained by SentencePiece as the initial vocabulary of Op-Tok for each task, and we initialized the neural unigram language model of OpTok by training the probabilities of its tokens to minimize KL diver-gence loss against the probabilities obtained using SentencePiece.
We then pre-trained the word embeddings with a bidirectional language model task on the training split of each dataset, and fixed them during the training of the text classification. Because the optimal tokenization is unclear during pre-training, we trained the bidirectional language model with sampling tokenization on each training epoch using SentencePiece. For Genre&Rating, we used the same word embeddings pre-trained on the training split. We did not use any outside resources for pre-training other than the training split.
We trained OpTok and the downstream model using a cross-entropy loss for the gold labels. We employed Adam (Kingma and Ba, 2014) to update the parameters with the default settings of PyTorch.
We set the smoothing hyperparameter α as 0.2 for both SentencePiece and OpTok as encouraged in Kudo (2018). For the training of our method, the size of the N -best tokenization of our method is N = 3, and the size of the restricted vocabulary |V | is half of the initial vocabulary size. At the inference, we used the 1-best tokenization and top-|V | of the vocabulary based on the language model. We conducted the experiments five times from a random initialization, except for the pretrained parameters, and reported the averaged F1 score in the result. The maximum training epoch was 20, and we selected a model with the highest performance on the validation split and evaluated it on the test split for each trial. Table 2 shows the performance of the downstream models using OpTok and SentencePiece. For Sen-tencePiece, we report the results when we set the vocabulary size identical to the restricted and initial vocabulary size of OpTok (SentencePiece and SentnecePiece x2 respectively).

Results
The experimental results demonstrate that the proposed method contributes to improving the per- formance of the text classification for each language and each task. The performance of OpTok was higher than the method trained using Senten-cePiece on both sized vocabularies. These results show that OpTok is superior to SentencePiece on the downstream tasks in our experiments.
The results of SNLI show that we can apply Op-Tok to the task whose input is multiple sentences. Moreover, OpTok has a positive effect on not only informal (sentiment analysis and Genre&Rating in our experiments) but also formal (SNLI) texts.
The proposed method only uses half of the initial vocabulary size at the inference. This fact validates the idea that OpTok contributes to a vocabulary reduction by selecting useful tokens.

Improvement Only by Tokenization
It is still unclear whether the optimized tokenization leads to the improvement described in Section 3.3 because we trained all components simultaneously. Thus, we investigate whether the optimized tokenization contributes to the improvement of the performance on the downstream task. To validate the effect of only tokenization, we trained only the neural unigram language model in OpTok. In other words, we fixed the neural encoder in OpTok and the downstream model with random initialization. We then checked the improvement of the training loss and the F1 score on the validation split by updating only the parameters of the neural unigram language model for tokenization.
We conducted experiments on Twitter(Ja) under the same setting as described in Section 3 and reported the results in Figure 3. Figure 3 shows the difference in the training loss and the validation F1  Table 3: Token ranking based on the positive difference in probabilities between the initial and learned language model of OpTok on genre and rating prediction.
score from the values at the beginning of the training. This figure indicates the training loss decreases corresponding to the number of epochs, whereas the validation F1 score increases. The results indicate that OpTok explored the tokenization which improved the task performance, and imply that the optimized tokenization contributed to improving the total performance in Section 3.3.

Task Oriented Tokenization
We are also interested in whether the optimized tokenization is different from each other when we address the different downstream tasks. To confirm this, we analyzed the results of the Genre&Rating prediction, mentioned in Section 3. The dataset contains two different tasks tied to the same review corpus. Table 3 shows the ranking of tokens whose probability significantly rise from the initial value on the genre and rating prediction tasks. The optimized neural unigram language model assigned higher scores to tokens that are useful for each task, e.g., zombie for Genre and bad for Rating. This result demonstrates that OpTok optimizes the tokenization to use helpful tokens frequently. Note that the difference in the probability is vast for the reason mentioned in Section 2.5.
We extracted an example of optimized tokenization from the validation split, which includes the difference in tokenization caused by tasks shown in Table 4. In the tokenization optimized for the genre prediction task, the model cut off an inflection of book-s to generalize the token book for predicting the proper genre, whereas the model optimized   for rating prediction separated the derivation of interest-ing to recognize similar tokens such as interested and interests in the same manner as a useful token for rating detection. In addition, the model for genre prediction does not split interesting, and the model for rating detection does not split books. This example shows that OpTok can optimize the tokenization for the downstream task.

Effect of Hyperparameter
In this paper, we introduce two hyperparameters to control OpTok: the size of restricted vocabulary and N for N -best tokenization. We report the effect of the hyperparameters on the performance of sentiment analysis. Figure 4 reports the effect caused by the size of the restricted vocabulary in each language. We checked the performance achieved by our method, for which we decreased the size of the vocabulary to 25%, 50% (the default settings used in other experiments), 75%, and 100% of the initial size. In the figure, we show a difference in the averaged F1 scores over five trials from scores reported in Table  2. As shown in the figure, restricting the vocabulary size to 50% contributes to an improved performance of Japanese and English datasets. These results validate that the restriction of the vocabulary works well for the proposed method. Meanwhile, decreasing the size of the vocabulary negatively impacts the performance proportionately to the Chinese dataset. In fact, the average best performance achieved by the full size of the vocabulary, 100% for 32K, was 93.14, which is higher than the score of OpTok shown in Table 2 by 0.32%. This result suggests that decreasing the size of the vocabulary is unnecessary for languages holding vast types of characters because such a restriction leads to a leaking of useful tokens and the production of many unknown tokens in both the training and evaluation, as reported by Hiraoka et al. (2019). Figure 5 shows the effect of N on the performance of a sentiment analysis. For all languages, N = 3 achieves the best performance, whereas increasing N decreases the performance. We consider the reason for the decline to be the gap in the encoding strategies between the training and evaluation. By using a larger N , a task-specific module, such as MLP for text classification, is trained using the weighted-sum of the various tokenizations, whereas the module takes a sentence representation encoded with the best tokenization in the inference.

Application for BERT
Numerous studies have recently been focused on exploiting pre-trained language models to enhance the NLP tasks such as BERT (Devlin et al., 2018). In this subsection, we demonstrate that OpTok is applicable to recent NLP modules based on BERT  by an experiment on Twitter(En). We replaced the BiLSTM encoder with BERT and conducted the same experiments as mentioned in Section 3. We employed BERT base from Hug-gingFace 7 and fine-tuned its parameters except those of the word embeddings as well as the above experiments. Because the distributed tokenizer for BERT base is based on WordPiece, which does not include the probabilities for each piece, we estimate the probabilities on the training split using the EM algorithm (Deligne and Bimbot, 1995;Liang and Klein, 2009;Kudo and Richardson, 2018) and initialized the language model of our method with these probabilities. We did not use a restricted vocabulary because the vocabulary of BERT contains many tokens unrelated to our experiment. Compared to the vocabulary initialized using Sentence-Piece on only the training split, restricting the vocabulary results in too little diversity of the N -best tokenization to cause overfitting of tokenization. We therefore found that it is not necessary to restrict the vocabulary, similar to the Chinese dataset mentioned in Section 4.3. We fine-tuned the parameters of BERT base using AdamW (Loshchilov and Hutter, 2017) while updating the neural unigram language model in OpTok with Adam. Table 5 shows the results of this experiment. We tokenized the corpus using the longest-first matching of WordPiece of BERT base . We trained the model of +Sampling tokenization with a stochastic tokenization like SentencePiece based on the language model initialized using the EM algorithm.
The results show that the pre-trained BERT improved the performance when comparing the scores to those in Table 2. In addition, incorporating Op-Tok with the BERT beat the original BERT system as well as the method using sampling tokenization. This result indicates that OpTok contributes to an improvement in the popular NLP architecture in terms of optimizing the tokenization. 7 https://github.com/huggingface/ transformers

Related Work
Numerous studies have aimed to improve the NLP tasks from the perspective of tokenization. Some studies have proposed an approach to prevent segmentation errors by encoding multiple tokenizations jointly. Recent studies investigated Lattice LSTM, which expands LSTM to allow multiple segmentations to be taken as a lattice (Chen et al., 2017;. Li et al. (2020) followed them to utilize a transformer.
Subword regularization is a famous solution to this problem (Kudo, 2018;Kudo and Richardson, 2018). The authors demonstrated that training models with various tokenizations contribute to an improved performance for machine translations. Provilkov et al. (2019) followed this approach by dropping tokens during the BPE process to vary tokenization, and Hiraoka et al. (2019) by updating the language model during training.
Optimization of the tokenization has attracted attention mainly in the field of machine translation. Some studies have attempted to optimize the tokenization using simple criteria for a machine translation (Xu et al., 2008;Chung and Gildea, 2009;Nguyen et al., 2010;Mermer et al., 2013). Recent studies also tackled this issue for generation tasks. Salesky et al. (2020) developed Incremental BPE, which automatically defines the number of BPE's merge operation for neural machine translation. He et al. (2020) proposed a neural architecture to find a better subword sequence from both the source and target corpora by enhancing the study in Chan et al. (2016). Our work differs from the existing research in that the proposed method is appliable to various neural encoders, and we can optimize the tokenization directly using only backpropagation from the training loss of the downstream tasks without any hand-crafted criteria.

Conclusion
In this paper, we propose OpTok which explores an appropriate tokenization to the target downstream task. We combine OpTok with the downstream model and train them simultaneously. The experimental results indicate that OpTok improves the performance of several downstream tasks through better tokenization. Moreover, OpTok also has a positive effect on pre-trained contextualized word embeddings such as BERT.