Syllable-based Neural Thai Word Segmentation

Word segmentation is a challenging pre-processing step for Thai Natural Language Processing due to the lack of explicit word boundaries.The previous systems rely on powerful neural network architecture alone and ignore linguistic substructures of Thai words. We utilize the linguistic observation that Thai strings can be segmented into syllables, which should narrow down the search space for the word boundaries and provide helpful features. Here, we propose a neural Thai Word Segmenter that uses syllable embeddings to capture linguistic constraints and uses dilated CNN filters to capture the environment of each character. Within this goal, we develop the first ML-based Thai orthographical syllable segmenter, which yields syllable embeddings to be used as features by the word segmenter. Our word segmentation system outperforms the previous state-of-the-art system in both speed and accuracy on both in-domain and out-domain datasets.


Introduction
Word segmentation presents a fundamental challenge for Thai language processing. Many of the down stream natural language processing tasks require that texts be broken into a sequence of words before applying any models. For example, bagofword models or RNN models usually require that the text is represented as a sequence of words. Notable examples of writing systems without clear word boundaries are Japanese, Chinese, and Khmer, which often require an automatic word segmentation. However, the Thai script differs from Chinese and Japanese in that a Thai character represents just a consonant or a vowel, but a Chinese character represents a whole syllable. The Thai script uses 44 consonant symbols, 15 vowel symbols, and 4 tonal symbols. A word is composed of one or more syllables, and each syllable is formed by a set of intricate orthographical rules for valid sequences of Thai alphabet symbols. Hence, it is not straightforward to extend the techniques from Chinese or Japanese to Thai word segmentation problem.
Thai Word segmentation is complicated by linguistic ambiguity and outofvocabulary cases. A word can be formed by juxtaposing two "words" e.g. เห็ นชอบ (/hěn tC hÔ :p/, approve) = เห็ น (/hěn/, see) + ชอบ (/tC hÔ :p/, like). This kind of word formation can be detected with a simple dictionary lookup, but harder cases that require context abound in the language. For example, กอดอกไม้ can be segmented into กอ|ดอก|ไม้ (/kO: dÒ:k máj/, flower bush) or กอด|อก|ไม้ (/kÒ:t Pók máj/, hugging a wooden chest), but the latter is rather nonsensical and statistically unlikely. The local context is needed to select the right segmenta tion; therefore, dictionaries only provide partial solutions to this problem. Further, a constant stream of new words and loanwords complicates the task of word segmentation because they never appear in the dictionaries.
In recent years, a few opensourced Thai word segmenters have been introduced and used widely in the industry. Notable examples of opensourced Thai word segmenters include PyThaiNLP (Phatthiyaphai bun et al., 2016), Sertis (Sertis Co., Ltd., 2017), and DeepCut (Kittinaradorn et al., 2019), which claim good performance according to their own respective benchmarks. The accuracies of these word segmen tation systems are not benchmarked on the same datasets for a rigorous comparison within and across domains. In this paper, we reimplement some of these baselines for comparison. More importantly, these techniques formulate the word segmentation problem as a charactersequence tagging problem and ignore the fact that characters form a syllable as an intermediate step to word formation.
Our systems take advantage of the fact that every word can be parsed into orthographical syllables. A sequence of Thai characters can be thought of as a sequence of orthographical syllables, each of which consists of at least two consonant characters or a pair of a consonant and a vowel. We hypothesize that syllable segmentation should simplify the downstream task of word segmentation. We explore the use of syllables in two ways. We use syllables as features (embeddings) for the word segmentation model, and we also try posing the word segmentation as a syllabletagging problem, as opposed to charactertagging problem. We present a neural word segmentation model that utilizes character and syllable embeddings as the representation and achieve stateoftheart result. We conclude that underlying linguistic structures (such as syllables) can serve as a suitable structure for neural architecture, which yields better performance as a result.
Our contributions can be summarized as follows: • We propose a neural Thai word segmentation model that outperforms the previous stateoftheart system in both indomain and outofdomain evaluations at F 1 scores 0.95 and 0.86 respectively . Our model also operates 8.9× and 3.5× faster than the previous system on CPU and GPU instances respectively. Our code is publicly available 1 .
• We develop the first MLbased Thai orthographical syllable segmenter, which achieves 0.96 syllable level F 1 and 0.99 characterlevel F 1 score and allows very little error to propagate to the downstream word segmentation task.
• We show that syllable segmentation helps word segmentation as a feature and as a structure for word segmentation. Our syllable segmenter is now part of PyThaiNLP package, which is widely used by the Thai NLP community.

Problem Statement: Thai Word and Syllable Segmentation
Given a string of Thai text, a Thai word segmenter identifies the word boundaries within the string. This problem is formulated as a classification problem or sequencetagging problem where we try to identify a character that starts a new word in the string. A Thai word is defined as the smallest lexical unit that still conveys the meaning (Aroonmanakun, 2007). Most ambiguous and debatable cases revolve around the degree of compositionality of nouns and verbs. The word การบ้ าน (/ka:n.bâ:n /, homework) is one word, although การ (/ka:n/, nominalization morpheme) and บ้ าน (/bâ:n/, house) are also words in other context. The meaning of การบ้ าน is not compositional from the two potential morphemes and therefore should be treated as one word. As a more uncertain example, it is arguable that the meaning of กอดอก (/kÒ:t Pòk/, to cross arms) is composed of กอด (/kÒ:t/, to hug) and อก (/Pòk/, chest) and therefore should not be treated as one word. Our dataset follows the guidelines that favor the segmentation takes the degree of compositionality into account and not grammatical function changes such as nominalization or verbalization.
Syllable segmentation task is defined similarly, but we try to find syllable boundaries instead of word boundaries. It is noteworthy that word boundaries are always subset of syllable boundaries because each syllable belongs to exactly one word. We use this fact as a basis for our model architecture. An orthographical syllable is defined as a substring in a word that can be pronounced as one or one and a half phonological syllable. For example, the word จิ นตนา can be segmented into two orthographical syllables จิ น (/tCin/), which is also phonologically one syllable, and ตนา (/ta.nǎ:/), which is phonologically two syllables. The mismatch between orthographical and phonological syllables is very common, and we believe orthographical syllable is more tractable computationally because each syllable is guaranteed to be at least two characters long and each syllable can also be a word itself in some cases.

Related Works
Thai word segmentation share some similarities with Chinese and Japanese word segmentation, but the linguistic differences require different types of models, but some insights from their techniques inspire our approaches. Both Chinese and Japanese word segmentation problems are formulated as character sequence tagging problem. BiLSTM has been found to be very effective for Chinese (Ma et al., 2018) and Japanese (Kitagawa and Komachi, 2018). Interestingly, lengthsensitive BIES tagset proves to be a better tagset that the standard BI tagset (Nakagawa, 2004). The BIES tagset cannot be readily applied to Thai word segmentation since Thai words are at least two characters long. In this study, we automatically group characters into syllables first, so that such lengthsensitive tagset can be applied.
Previous approaches to Thai word segmentation can be categorized into two categories: dictionary based and learningbased. (Poowarawan, 1986) proposes the first dictionarybased method using a greedy algorithm to decide when a word should be formed. Dictionarybased methods inevitably suffer from unseen words, and hence are harder to generalize to other domains. (Sornlertlamvanich, 1993) uses Maximal Matching, to handle such unseen word cases.
Past machinelearning word segmenters utilize subword features. (Theeramunkong et al., 2000) pro pose Thai Character Clusters (TCCs). TCCs differ from the notion of syllables used in our work in that TCCs are based loosely on the orthographical syllables and designed with information retrieval applica tions in mind. Using TCCs, (Theeramunkong and Usanavasin, 2001) develop a decision tree classifier to determine whether a word should be formed from TCCs, based on a predefined metric. (Aroonmanakun, 2002) presents a twostage word segmentation that incorporates handcrafted syllable features with dy namic programming to form the most reasonable segmentation, while (Bheganan et al., 2009) use a hid den Markov model to form words that are then verified with a dictionary.
Modern approaches to this problem are used by practitioners, but it is unclear which approach is the most accurate or fastest. (Kittinaradorn et al., 2019) present DeepCut, a 1dimension CNN for Thai word segmentation. The alleged stateoftheart system (Sertis Co., Ltd., 2017) and (Lapjaturapit et al., 2018) use a BiRNN to segment words with multiple possible segmentation candidates. These results are not systematically benchmarked or published, but we use these architectures as our baselines.  Figure 1: CNNbased word segmenter with charac ter and syllable features using 3level Iterated Di lated Convolutions (Strubell et al., 2017). Colors represent different embeddings. Word in figure is กาล เวลา (/ka:n we: la:/, time).

{B, I}
We propose two Thai word segmentation sys tems based on BiLSTM and CNN architectures. The BiLSTMbased system follows the stan dard formulation of sequencetagging BiLSTM model. For the CNNbased system, unlike DeepCut that uses multiple convolutions on features, we employ Iterated Dilated Convolutions proposed by Strubell et al. (2017). Iterated Dilated Convo lutions is a hierarchical dilated convolutional lay ers that the dilation number increases doubly at each layer. It allows the model to consume suffi cient context with fewer parameters, and Strubell et al. (2017) demonstrate its advantages in pre diction performance on entity recognition tasks and computational efficiency. We use a stack of three levels with width three and dilation num bers 1, 2, and 4; In total, it covers the context of seven characters on each side.
After the convolution layers or the BiLSTM layers, we have two fullyconnected layers, combined extract features to the probability of tags. Figure  1 depicts the convolution architecture in details.
Our systems use syllables as additional features and structures to the model. Both systems use char acter embeddings and syllable embeddings as features. For a given character, we first find the syllable that contains the character and then look up the syllable embedding. Syllable embeddings should provide higherlevel information than a character type or a character embedding can afford. As an additional ex periment, we reformulate the task as a syllablesequence tagging problem instead of a charactersequence tagging one. In this formulation, we use syllable embeddings only for both BiLSTM and CNNbased models.
We use new tagsets to take advantage of the fact that words with different lengths are distributed differently. The standard tagset uses binary tags: B for the beginning of a new word and I for the inside of a word. We explore a lengthsensitive tagset: BIshort, BImid, and BIlong for words that have 12 syllables, 34 syllables, and 5+ syllables, which we call Scheme A. We also experiment with another lengthsensitive tagset: BIk tag where k = {1, 2, 3, 4+} is the length of the word in terms of number of syllables, which we call Scheme B.
All of the proposed models require an automatic syllable segmenter as a preprocessing system for the word segmenters. Thai syllables are mostly bound by rules, but ambiguity and typos are natural occur rences in real data. To the best of our knowledge, there is no previous work in Thai syllable segmentation, and syllables have never been used as features for MLbased word segmentation systems. We propose a Conditional Random Fields (CRF) syllable segmenter primarily to be used as a preprocessing step for the word segmenter. We use templatebased ngram features to capture the local context of the sequence.

Evaluation Metrics
Word starting character (1)

Correctly tokenized word
Wrongly tokenized word ✗
Evaluating the quality of word segmenters is typ ically done on a characterlevel (CL) basis. Stan dard measured metrics are precision, recall, and F 1 of startingword characters. However, in tuitively, when a token is tokenized wrongly, it would consequentially affect the tokenization of following tokens. Thus, measuring only the characterlevel metrics would overestimate the tokenization performance of syllable or word to kenizers. Therefore, in this work, we also mea sure the chunklevel metrics-the syllable level (SL) or world level (WL). SL and WL F 1 are cal culated based on the number of correct syllables and words respectively rather than characters. Figure 2 illustrates the characterlevel and chunklevel measures.

Experiment 1: Syllable Segmenter
We hypothesize that syllable segmentation is simple enough that we can treat it as a preprocessing step for word segmentation. We use pycrfsuite implementation of CRF (Peng andKorobov, 2018; Okazaki, 2007) and train the model on Thai National Corpus (TNC) (Aroonmanakun et al., 2009). The TNC contains a subcorpus of 2.56M annotated syllables (around 8M characters). The dataset is split threeway for training, development and testing using the 70:20:10 scheme. The training, development, and test sets contain 1.8M syllables, 0.5M syllables, and 0.25M syllables respectively. We hypothesize that CRF is suitable for syllable segmentation because of its inclusion of sequential information. We test this hypothesis by comparing it against a maximum entropy model (MaxEnt), trained using the scikitlearn implementation (Pedregosa et al., 2011). For both algorithms, we experiment with the following features, with N and window size W of 1 to 4: i) individual characters within W places around on both sides of a potential boundary (Chr); ii) two Ngrams on both sides (ChrSpan); iii) Ngram features that include all Ngrams within W places on both sides.

Experiment 2: Word Segmenter
Our training data is BEST2010 (NECTEC, 2010). Annotated with word boundaries and name entities, the corpus contains 415 Thai documents from four categories: news, articles, encyclopedias, and novels, accounting for 134,107 samples (split by line), around 5.11M words, and 18.74M characters. Balancing the distribution of categories, we take 10% of the training set as a validation split. We use the official provided test set (officially named as TEST_100K) for evaluation, which has 2,252 samples, about 128K words, and 496K characters.
Apart from indomain evaluation on BEST2010 test set, we also perform crossdomain word segmen tation evaluations on another two datasets: i) Thai National Historical Corpus (TNHC) (Sawatphol and Rutherford, 2019) contains 20,791 samples, around 599K words, and 2.14M characters of Thai classical literature documents with word boundaries annotated by humans, and ii) Wisesight corpus (Wisesight (Thailand) Co., Ltd., 2019) contains 26,700 social media messages. Because the Wisesight dataset does not have word boundary annotation, we randomly take 1,000 samples (with 7 spam messages removed) from the test split and manually segment them using the same word segmentation standard proposed by Aroonmanakun (2007). We call this Wisesight1000: it has around 22K words, and 75K characters; our annotation is available at Wisesight's repository. Appendix B summarizes the statistics of the datasets.
Our word segmenters are from two models families, namely BiLSTMs and CNNs with Iterated Dilated Convolutions, or (IDCNNs) for short. These word segmenters come with variants of: • CRF: CRF can be employed in the last step of prediction.
• Features sets: we experiment with i) character and character type embeddings and ii) syllable em beddings for charactersequence models. See Appendix C.4 for the details of our character type and syllable dictionaries; • Tagset: discussed in Section 4, we also experiment with three tagsets: BI, SchemeA, and SchemeB (BIk tagset); We use the convention of (Model-Family)[-CRF](Features)-(Tagset) to refer to the specific con figuration of each model. We compare our word segmenters with two strong baselines: i) PyThaiNLP (Phatthiyaphaibun et al., 2016) with its maximal matching engine (Sornlertlamvanich et al., 1997); ii) DeepCut (Kittinaradorn et al., 2019), a characterlevel CNN (Zhang et al., 2015) with a stack of 13 convolutional filters, followed by pooling and fullyconnected layers, comprising around 500K trainable parameters. iii) BiLSTM(CH)BI and IDCNN(CH)BI, which are similar to the existing Thai word segmenters: (Sertis Co., Ltd., 2017) and (Kittinaradorn et al., 2019) respectively.
We use PyTorch (Paszke et al., 2019) for implementing word segmenters and train them using Adam (Kingma and Ba, 2015). For each model family, we perform 20 hyperparameter search trials using the random search strategy (Bergstra and Bengio, 2012). We use batch size of 32 and 128 for characterlevel and syllablelevel models respectively. Detailed information about our training settings, hyperparameter search space, and the best configuration of each model configuration from the search can be found in Appendix D.

Syllable Segmentation Performance
Both MaxEnt and CRF achieve nearperfect performance (Table 1). This suggests that the output from syllable segmentation is suitable for downstream as it allows very little error to propagate. Our top performing models are CRFbased with character and trigram features (N = 3, W = {3, 4}) for both feature types). We observe that syllable boundaries from the model with W = 4 better align with word boundaries in BEST2010 validation set; hence, we use this syllable model in our word segmentation experiment.    (Table 2). The effects of syllable embeddings are more pronounced when we only consider outof vocabulary (OOV) cases (Table 4), where the models need to generalize rather than simply memorize words found in the training set. Syl lable embeddings improve OOV recall from 59.58% to 67.42% for BiLSTM and from 51.92% to 64.09% for IDCNN. This further confirms that syllable embeddings help improve   We observed modest or inconsistent improvement from the CRF layers and the length sensitive tagsets (SchemeA and SchemeB). This might be because BiLSTMs and CNNs already take the context of the sequence into account. The pattern of tagsets is likely to be found by these models without explicitly given. This contradicts the previous findings where BIE tagsets were used effectively in Chinese (Nakagawa, 2004).

Word Segmentation Performance
Additionally, we also used the methods proposed by Dodge et al. (2019) to calculate the expected performance given computational budget. Figure 3 shows the expected validation performance (WL F 1 ) versus the computational budget displayed as training duration on Nvidia GTX 1080, which is the average training time times n, for different model configurations from BiLSTMs and IDCNNs. We see that models with syllable features have higher expected validation performance than models using character features onlys (BiLSTM(CH)BI and IDCNN(CH)BI). Hence, one needs significantly less amount of computation budget for hyperparameter optimization when incorporating syllable features.
Interestingly, when considering training duration budget below seven hours, Figure 3 illustrates that IDCNN(CH+SY)BI would yield higher expected performance than BILSTMCRF(SY)BI and IDCNNCRF(SY)SchemeA. This might be useful information for practical settings. Although BiLSTM(CH+SY)BI provides the most optimal expected validation performance and training duration curve, using such a model might not be efficient in certain settings that only CPU instances available. We shall discuss the issue in Section 8. We perform further analysis on how IDCNN(CH+SY)BI and BiLSTM(CH+SY)BI utilize the given features across the sequence. We use Integrated Gradients (Sundararajan et al., 2017) to explain tag predictions of these models by computing the attribution of each position to the prediction. We set the interpolation step size to 100 and use Captum 2 for Integrated Gradients computation. We use a sequence of padding characters as the baseline input sequence (Mudrakarta et al., 2018) and take the absolute values of the attribution scores. Figure 4 shows the amount of attributions determining word segmentation prediction for เวลา|ใน|ความหมาย| ของ|วิ ทยาศาสตร์ (/we: la: naj k h wa:m mà:j k hÓ :ï wít ja: sá:t/, time in a scientific definition). One can see that the attributions concentrates around a narrow band surrounding the diagonal line for both IDCNN(CH+SY)BI and BiLSTM(CH+SY)BI, suggesting high contributions from neighbor charac ters. Therefore, this evidence implies that going through all the locations in the sequence as done in BiLSTM(CH+SY)BI might be not optimal for the Thai word segmentation task. To provide quantitative measures, we select samples whose length is between 21 and 50 char acters from the BEST2010 test set, resulting in 338 samples. We then calculate the attribution amount allocated to each surrounding location up to 10 character left and right ( Figure 5). Attri bution scores are concentrated around three char acters on both sides. This suggests that mod els such as IDCNNs that consume features from only a selective set of neighbor locations are suf ficient for our segmentation purpose, rather than going through the sequence as in BiLSTMs.

Explaining Word Segmenters
Moreover, we also see here that the two attri bution distributions are slightly leftskewed, sug gesting models rely more on neighboring charac ters from the left side than the right side. This is reasonable because it aligns with the writing direction of Thai language.  t2.medium is a CPU instance, while p2.xlarge is a GPU instance. Numbers in parentheses indicate a speed factor (higher is faster) to the reference method displayed with (1×). bs stands for batch size.

Speed Benchmark
We perform speed benchmark of two existing methods, namely PyThaiNLP and DeepCut, and our word segmenters. We conduct the benchmark on two AWS cloud instances: t2.medium and p2.xlarge; the former is a CPU instance, while the latter is a GPU instance. We use UNIX's time to measure the wall clock of segmentation and AWS's official deep learning image 3 . For each method or configuration, we perform five runs in which the first two runs are burnin, and we average segmentation time from the last three. We refer to Appendix A for our benchmark code. Table 5 shows that IDCNNs are generally faster than BiLSTMs on t2.medium and p2.xlarge when batch size is small (bs=1); they are on par for a large batch size (bs=128). Comparing to DeepCut, our syllable models are faster. More precisely, the best syllable variants of BiLSTMs and IDCNNs are 3.3× and 8.9× faster than DeepCut respectively, on t2.medium. The speed factor of our best models reduce to around 3× on p2.xlarge(bs=1).
For p2.xlarge(bs=128), we did not benchmark DeepCut in this setting because its API does not allow batch size configuration. We use BiLSTM(CH)BI's time as the reference instead. From the ta ble, we see that both BiLSTMCRF(SY)BI and IDCNNCRF(SY)SchemeA are on par in terms of speed. Nevertheless, we consider here the results on t2.medium the most important one because word segmentation is one of the first step in NLP, and it should be efficient without a special hardware.

Conclusion
In this work, we propose to use syllable knowledge for Thai word segmentation. We perform a system atic comparison between existing Thai word segmenters and our different model configurations based on LSTMs and CNNs. Our results show that incorporating syllable can improve word segmentation performance significantly both indomain and out domain evaluations. We also analyze results from hy perparameter search, and the analysis shows that using syllable features require less computational budget for the search to find word segmenters with decent performance.
We also explain two characterlevel word segmenters using Integrated Gradients, showing that the LSTMbased word segmenter relies only a faction of neighbor character when predicting tags of the sequence, suggesting that using CNNs models are more efficient than LSTMs. Our speed benchmark shows that our best and fastest syllable model IDCNNCRF(SY)SchemeA is 8.9× and 3.5× faster than DeepCut, the current state of the art, on CPU and GPU instances respectively.
In future, we would like to investigate 1) the endtoend learning paradigm, combining syllable and word segmentation into one model; 2) the effect of pretrained syllable embeddings; and 3) the effect of word segmentation on downstream tasks.

A Code
Our code is available at github.com/heytitle/SyllablebasedNeuralThaiWordSegmentation.

C.1 Syllable Segmentation for Word Segmenters
For word segmenters that rely on syllable features, we first split the input sentence to separate parts based on punctuation signs. Then, each part is syllablesegmented, and the list of syllables of the sentence is formed from these parts' syllables.

C.2 BEST2010 Out of Vocabulary Experiment
We first select words that are in BEST2010 test set but not in the training set. We then select only Thai words. More details can be found at oov.py in our repository.

C.3 TNHC
We remove one file whose name is จดหมายเหตุ รายวั นของสมเด็ จเจ้ าฟ้ ามหาวชิ รุ ณหิ ศ.json out due to encoding issues. For the rest, we remove meta tags, such as author, data, or filename from each file. Our preprocess script can be found at process-tnhc.py in our repository.

C.4 Character Type and Syllable Dictionaries
Our character type dictionary is drawn from DeepCut's character type dictionary 4 . It separates characters into 12 categories, shown in Table 7.
For the syllable dictionary, we extract all syllables from BEST2010 training set and then map each syllable into a category if one of our regular expression rules applies. We have four categories, namely 1) punctuation, 2) URLs, 3) sequence of English alphabets, and 4) number.

C.5 Benchmark
We remove nameentity tags and separators (|), both repeated and tailing ones from the reference and given input. We refer to our benchmark.py in the repository for more details.