Domain Adaptation of Thai Word Segmentation Models Using Stacked Ensemble

Like many Natural Language Processing tasks, Thai word segmentation is domain-dependent. Researchers have been relying on transfer learning to adapt an existing model to a new domain. However, this approach is inappli-cable to cases where we can interact with only input and output layers of the models, also known as “ black boxes ”. We propose a ﬁlter-and-reﬁne solution based on the stacked-ensemble learning paradigm to address this black-box limitation. We conducted extensive experimental studies comparing our method against state-of-the-art models and transfer learning. Experimental results show that our proposed solution is an effective domain adaptation method and has a similar performance as the transfer learning method.


Introduction
Word Segmentation (WS) is an essential process for several Natural Language Processing (NLP) tasks such as Part-of-Speech (PoS) tagging and Machine Translation (MT). The accuracy of WS significantly affects the accuracy of these NLP tasks, as shown in experimental results from Nguyen et al. and Chang et al. While WS is considered relatively simple in English, it is still an open problem in languages without explicitly defined word delimiters, such as Thai, Chinese, and Japanese. However, unlike Chinese and Japanese, Thai WS did not receive much research attention. There are only six notable publications (Chormai et al., 2019;Nararatwong et al., 2018;Noyunsan et al.;Thanadechteemapat and Fung;Tongtep and Theeramunkong) on Thai WS for the past ten years. On the other hand, there are at least eight papers from well-established conferences on Chinese and Japanese WS (Li et al., 2019;Aguirre and Aguiar, 2019;Ma et al., 2018;Gong et al., 2017;Chen et al., 2017;Zhou et al., 2017;Cai et al., 2017) within only the last two years. This investigation focuses on the segmentation of Thai words since it is a challenging problem that has an excellent opportunity to improve, especially in the area of domain adaptation.
Like many NLP tasks, Thai WS is domaindependent. For instance, Chormai et al. (2019) recorded an accuracy drop from 91% to 81% when their model trained on a generic domain corpus (Kosawat et al., 2009) was tested on a social media one (bact' et al., 2019). Results from our analysis (Section 3) also conform to these findings.
One way to solve the domain dependency problem is through Transfer Learning (TL), which is a common technique in domain adaptations (Schuster et al.; Chang et al.). However, TL may not be applicable when working with a commercial API or a model that does not support weight adjustments (Chormai et al., 2019;Chuang, 2019;Ikeda, 2018). We call this type of model a black box.
In this paper, we propose a stacked-ensemble learning solution to overcome the black-box limitation. Instead of making changes to the existing model directly, we build a separate model to improve the accuracy of predictions made by the black box. Our solution comprises two parts, Domain-Generic (DG) and Domain-Specific (DS). The pretrained black box handles the Domain-Generic part, and a new model is constructed to handle the Domain-Specific part. All samples go through Domain-Generic, which makes initial predictions. We rank all predictions according to uncertainty and send the top-k uncertain predictions to Domain-Specific for further consideration. We combine the predictions from Domain-Specific with the remaining from Domain-Generic to form the final predictive results.
We conducted extensive experimental studies to assess our solution's performance against a base-line model and transfer learning solutions. We also applied our Stacked-Ensemble Filter-and-Refine (SEFR) technique to Chinese and Japanese. Experimental results showed that our proposed solution achieved the accuracy level comparable to those of transfer learning solutions in Thai. For Chinese and Japanese, we showed that model adaptation using the SEFR technique could improve the performance of black-box models when used in a cross-domain setting.
Our contributions are as follows. First, we propose a novel solution for adapting a black-box model to a new domain by formulating the problem as an ensemble learning one. Second, we derive a filter-and-refine method to speed up the inference process without sacrificing accuracy in some cases. Third, we conducted extensive experimental studies; experimental results validate the effectiveness of our solution. Fourth, we make our code available at: github.com/mrpeerat/SEFR_CUT 2 Stacked-Ensemble Method 2.1 Pipeline Structure Figure 1 displays the pipeline structure of the proposed SEFR method, which consists of a Domain-Generic (DG) black box, uncertainty filtering, and a Domain-Specific (DS) model. Each character enters the pipeline through the Domain-Generic black box, which gives a softmax or logistic score from the Domain-Generic model as output. We then use this output to calculate the uncertainty score. Uncertainty values are used to rank and filter samples that need reexamination by the Domain-Specific model. We then merge the results from Domain-Specific with the direct answers from Domain-Generic to form the final answers.

Pipeline Implementation
In this subsection, we consider how to implement the pipeline in Figure 1 effectively. An effective filter and refine pipeline should have the following properties. First, before the filter, there is a general-decision maker that can make most decisions reasonably well. Second, the filter should be able to separate out decisions not requiring further consideration. Third, after the filter, there is a decision maker that can make the remaining decisions better than the general-decision maker. Using filter and refine can help reduce the computation time and avoid unnecessary errors from the Domain-Specific model which might not be as robust as the Domain-Generic model. Before-Filtering Model. In our pipeline, the general-decision maker is the Domain-Generic black box. Specifically, we use the state-of-the-art pre-trained model (Rakpong Kittinaradorn, 2019; Chormai et al., 2019) constructed from a genericdomain corpus (Kosawat et al., 2009) to ensure the best possible performance in general cases. Filtering. The prediction of an out-of-domain sample is likely to have an entropy higher than that of an in-domain one. Hence, we use the softmax entropy to separate the results from Domain-Generic into two groups: (i) high-uncertainty predictions that need further consideration from Domain-Specific; (ii) low-uncertainty predictions that we keep the results unchanged. The exact cutoff point can be fine-tune as a hyperparameter. After-Filtering -Model. As stated earlier, the model placed after filtering should perform a certain task better than the one placed before filtering, which is domain specificity in this case. Hence, we use a Domain-Specific model trained with targetdomain data to refine the uncertain predictions made by Domain-Generic. In theory, a Domain-Specific model can be constructed using any learning method. However, a DNN-based method may be inapplicable in a data-poor setting, which is the case in this investigation. As a result, we focus on classical learning methods that historically provide good results in WS problems, such as Logistic Regression (LR), Support Vector Machine (SVM), and Conditional Random Field (CRF). Figure 2 shows performance evaluation results from different Domain-Specific implementations. As can be seen, CRF gave the best performance in comparison to other models. After-Filtering -Input Features. We consider the following 4 sets of features. First, we use the ngram windows to capture the context. Second, we use the dictionary index to identify whether each character can start a word (Horsuwan et al.). The next two feature sets are meta features obtained from the Domain-Generic model, i.e., the softmax output and the softmax entropy.

Performance Evaluation
We evaluated our SEFR solution against state-ofthe-art models on nine benchmark corpora from three languages. Specifically we studied the effect of our SEFR method and report the performance by adapting a black-box model to a new domain by formulating the problem as an ensemble learning.

Performance Evaluation on Thai
Competitive Methods. Two state-of-the-art models for Thai WS were chosen as our competitive methods, i.e., DeepCut (Rakpong Kittinaradorn, 2019) and AttaCut-SC (Chormai et al., 2019). Both are deep learning models based on the Convolution Neural Network (CNN). We also created two SEFR solutions using DeepCut and AttaCut-SC as the Domain-Generic model, and we called them SE+DeepCut and SE+AttaCut-SC, respectively. As domain-adaptation baselines, we applied transfer learning to DeepCut and AttaCut-SC and called them TL-DeepCut and TL-AttaCut-SC, respectively. We note that the authors of DeepCut provided the weights trained on the BEST corpus. We used the same architecture and parameter settings to update these weights on Wisesight and TNHC (Table 1). Attacut-SC does not provide weights to perform TL and requires retraining of the model. We trained the AttaCut-SC model using BEST-2010 corpus (Kosawat et al., 2009) to obtain the best training weights to perform TL, where 90% of the data was used for training. We compared our method with a model pre-trained on BEST-2010 and then transferred to the target task.  Experimental Studies. Our Method vs Thai Competitive methods. In this part of the paper, we compared our SEFR method against the state-of-the-art Thai WS methods on WS160 and TNHC. The experimental results given in Tables 3 and 4 show that for all corpora, i.e., WS160 and TNHC, our method (SE+DeepCut) outperformed the state-ofthe-art DeepCut in the domain adaptation experiment. Moreover, SE+AttaCut-SC outperformed AttaCut-SC and TL-AttaCut-SC for all corpora. In particular, for the TNHC corpus, SE+DeepCut performed better than DeepCut by 1.7% and 0.3% at the character and word levels, i.e., char F1 and word F1, respectively. SE+AttaCut-SC outperformed AttaCut-SC and the different was 13.9% and 22.2% at the character and word levels respectively. The F1 Score gap between SE+DeepCut and TL-DeepCut was about 1.1% at the character level and 10.5% at the word level for both corpora. However, for AttaCut-SC, SE+AttaCut-SC performed better than TL in every corpus averaging about 12.9% at the character level and 12.7% at the word level. This result showed that our method could provide reasonable performance despite the lowaccuracy predictions provided by the base model. Effect of Top-k Percentage Entropy Selection. Figure 2 shows the effect of top-k percentage entropy selection on test sets of WS160 and TNHC using DeepCut as the Domain-Generic model. As expected of a filter and refine method, recall improves as the k value increases. Most of the recall improvements are from the lower range k values,   showing the effectiveness of the entropy-based filtering. For WS160, the F1 peaks at k = 100 due to the fact that precision also keeps increasing at every k value. In this case, our filter and refine method can be viewed as a re-scoring method. This is due to the effectiveness of CRF (Figures 2c and  2f) classifier. As shown in Figure 2, unlike CRF, increasing the k value past a certain threshold negatively affects the performance of SVM (Figures 2a  and 2d) and LR (Figures 2b and 2e) models due to their weaker performance. Removing certain input features to the CRF model such as the sotfmax output also decrease the overall performance. For the TNHC dataset the best k value is around 30% showing the importance of filtering to reduce the potential candidates.

Evaluation on Chinese and Japanese.
Chinese Word Segmentation (CWS). In this experiment, we used the existing CWS model called PyWordSeg (Chuang, 2019) with character-level ELMO embedding. Normally, CWS categorizes characters into four classes: (i) beginning (B), (ii) internal (I), (iii) ending (E), and (iv) single-word (S) (Li et al., 2019). However, PyWordSeg classi-fies each character as boundary or non-boundary character which is similar to Thai WS, so we used the same feature as Thai WS in this experiment. We performed the experiment on SIGHAN 2005 dataset (see Table 2) as a single corpora experiment not domain switch like Thai. However, PyWord-Seg did not provide a probabilistic prediction that can be used to measure the uncertainty score on English and Pinyin characters. Therefore, we left out those sentences from the evaluation. Note that the released PyWordSeg code does not lend itself for straightforward transfer learning experiments, exemplifying our use case of treating models as black boxes.
For the CWS task, we evaluated our method on four corpora including AS, CityU, PKU, and MSR using the character-level F1. The results shown in Table 5 indicate that our method is better than the competitors.  Japanese Word Segmentation (JWS). In this experiment, we performed JWS using Nagisa (Ikeda, 2018), trained on the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al.). This model categorizes characters into four classes: (i) beginning (B) (ii) middle (M) (iii) ending (E), and (iv) single-word (S) (Kitagawa and Komachi, 2018).

Method
We performed experiments using the Universal Dependencies 2.5 dataset (Asahara et al., 2018) on the GSD, Modern, and PUD subset (see Table  2). The results given in Table 6 indicate that our method outperformed JWS on all corpora. Specifically, our method reports performance improvement of 3% on GSD, 4.7% on Modern, and 11.5% on PUD using the character-level F1.

Conclusion
We proposed a novel solution for adapting a blackbox model to a new domain by formulating it as an ensemble learning problem. We conducted extensive experimental studies using nine benchmark corpora from three languages. For Thai Word Segmentation, the results showed that our method is an effective domain adaptation method and has similar performance as the transfer learning method. The results from Japanese and Chinese Word Segmentation experiments showed that our method could improve the performance of Japanese and Chinese black-box models.

A.1 Additional experimental details and results
Experimental environment. The experiments were conducted on Intel Core i9-9900X CPU @ 3.50GHz running on CentOS 7 with one Nvidia GeForce RTX 2080 Ti and 62 GB RAM. All the methods were implemented in Python and their performance and running time is provided in Table 7.   Evaluation measures. we measured the F1 scores for both character and world levels. For the character level, we used Sklearn (precision recall fscore support) with binary average. For the word level, we applied the same practice as AttaCut (Chormai et al., 2019) in measuring the recall and precision. The performance comparison of our method on BEST corpus with DeepCut and AttaCut-SC is displayed in Table 10. Ablation study: Feature. We also measured how different feature types affect the performance of our proposed solution, SE+DeepCut. Results are shown in Table 9     in Figure 3, DeepCut was trained on BEST corpus where the annotation rules grouped words into a compound word resulting in a model that produced large word chunks. However, with not much English word in the corpus and no hashtag segmentation samples to train the model, DeepCut failed to segment English words correctly. On the other hand, the SE+DeepCut method was training on Wisesight (training set) therefore the behavior of our method is to split a compound word into multiple single word and our method can perform on English word and hashtag better than DeepCut. Thus, we need more data on the social media domain to support these domain characteristics with a good annotation guideline of data. We also add the random examples from Thai, Chinese, and Japanese the result are given in Figure 4.