2kenize: Tying Subword Sequences for Chinese Script Conversion

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.


Introduction
Chinese character (or script) conversion is a common preprocessing step for Chinese NLP practitioners (Zhang, 2014;Shi et al., 2011). Traditional Chinese (TC) and Simplified Chinese (SC) are the two standardized character sets (or scripts) for written Chinese. TC is predominantly used in Taiwan, Hong Kong, and Macau, whereas SC is mainly adopted in mainland China and SC characters are simplified versions of TC characters in terms of strokes and parts. Therefore, Chinese NLP practitioners apply script converters 1 to translate the  dataset into their desired language. This is especially useful for TC NLP practitioners because TC is less widely used and under-resourced as compared to SC. Converting from TC to SC is generally straightforward because there are one-to-one correspondences between most of the characters, so conversion can be performed using mapping tables (Denisowski, 2019;Chu et al., 2012). However, conversion from SC to TC is an arduous task as some SC characters can be mapped to more than one TC character depending on the context of the sentence. A detailed analysis by Halpern and Kerman (1999) shows that SC to TC conversion is a challenging and crucial problem, as 12% of SC characters have one-to-many mappings to TC characters. Our experiments show that current script converters achieve sentence accuracy results of 55-85% ( §3).
Another issue is that varying tokenization would lead to different results as Chinese is an unsegmented language, see Table 1 for an example. Off-the-shelf script converters would translate 维 护发展中国家共同利益 into 維護髮展中國家 共同利益 2 , whereas the correct conversion is 維 護發展中國家共同利益. Here, the SC character 发 (hair, issue) has two TC mappings, 髮 (hair, issue) and 發 (hair, issue), depending on the context and tokenization; which shows that this task is non-trivial.
Despite this being an important task, there is a lack of benchmarks 3 , which implies that this problem is understudied in NLP. In this study, we propose 2kenize, a subword segmentation model which jointly considers Simplified Chinese and forecasting Traditional Chinese constructions. We achieve this by constructing a joint Simplified Chinese and Traditional Chinese language model based Viterbi tokenizer. Performing mapping disambiguation based on this tokenization method improves sentence accuracy by 6 points as compared to off-the-shelf converters and supervised models. Our qualitative error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities. Additionally, we address the issue of a lack of benchmark datasets by constructing datasets for script conversion and TC topic classification.

2kenize: Joint Segmentation and Conversion
We employ subword tokenization, as it addresses the issue of rare and unknown words (Mikolov et al., 2012) and has been shown advantageous for the language modelling of morphologicallyrich languages (Czapla et al., 2018;. This achieves improvements in accuracy for neural machine translation (NMT) tasks and has now become a prevailing practice (Denkowski and Neubig, 2017). The most widelyutilized method is Byte Pair Encoding (BPE, Sennrich et al. (2016)), a compression algorithm that combines frequent sequences of characters, which results in rare strings being segmented into subwords. Unigram (Kudo, 2018) and BPE-Drop (Provilkov et al., 2019) use subword ambiguity as noise, as well as stochastically-corrupted BPE segmentation to make it less deterministic. For NMT tasks generally, subword segmentation is seen as a monolingual task and applied independently on source and target corpora. We hypothesize that translation tasks, and specifically conversion tasks, as investigated here, would have a bet-3 The ChineseNLP website states that script conversion benchmarks and experiments currently do not exist: https://chinesenlp.xyz/#/docs/simplified_ traditional_Chinese_conversion ter performance if segmentation were performed jointly. Hence, in this section, we describe our proposed method 2kenize, which jointly segments by taking the source and its approximate target sentences into account. This motivates the main idea of this paper: We propose 2kenize which jointly considers the source sentence and its corresponding target conversions by doing lookaheads with mappings.

Outline of the proposed approach
Given the possible SC character sequence s = s 1 s 2 . . . s n and TC character sequence t = t 1 t 2 . . . t n , we want to find the most likely t, which is given by the Bayes decision rule as follows: where T * denotes the set of all strings over symbols (t i ) in T (Kleene star). We divide this problem into two parts: finding the mapping sequence (2) and finding the TC sequence from mappings (7). We define a mapping, which is given by m i = (s i , t i ) = (s j:k , t j:k ). Here, t j:k = {t 1 j:k . . . t n j:k } is a set of TC characters that correspond to the SC character in the mapping. Thus, a mapping sequence can be defined as a concatenation of mappings, which is m = m 1 m 2 . . . m l . Let M be the superset of all possible mapping sequences and M(s) be the all mapping sequences resulting from s. Then, the best possible mapping sequence is given by Morever, p(m) can be expanded as such: After expanding the mapping sequences (4), we take an approximation by estimating this as the sum of likelihoods of two sequences formed due to co-segmentations (5). The set of possible TC sequences is given by the Cartesian product of t i . These likelihoods can then be estimated using language model (LM) probabilities as shown in (6).
维护发展中国家 共同利益  Once the mapping sequence m has been found, all possible TC sequences are found from the set m t , which is the Cartesian product for all t i in m. From (7), we calculate approximate final sequence using beam search.

Model Architecture
Viterbi, a dynamic programming (DP) algorithm, considers phrases (or subsequences) and performs segmentation in a 'bottom-up' fashion (Nagata, 1994;Sproat et al., 1996). RNN-based language models are theoretically considered to be '∞'gram (Khandelwal et al., 2018), which consitutes a challenge. Consider this sentence, 维 护发展 中 国 家 共 同 利 益. A potential challenge could be to adquately estimate the probability of 共同 利 益. As this sequence occurs infrequently in the beginning of sentences in the corpus, an RNN would under-estimate the probability of this subsequence. Moreover, an RNN would likely lose some useful context and perform worse without it (Kim et al., 2019). So for Viterbi to perform well with an RNN, we train the language model on subsequences. We approach this by training our model in such a way that it samples subsequences ran- Using Kudo (2018) regularization method, we sample from the n-best segmentations in each epoch. This is done so that the model can understand different segmentations of a subsequence using a similar motivation as above. Recent works have shown that varying subword segmentations lead to a better downstream model performance (Provilkov et al., 2019;Kudo, 2018;Hiraoka et al., 2019); therefore, we use it as a data augmentation strategy. Once we get the n-best segmentations with scores, we normalize them, and then use the normalized scores as sampling probabilities (see Fig 1). As opposed to other subword tokenizers where the vocabulary size is fixed, we do not limit the vocabulary in our model. Hence, there are numerous possibilities of segment combinations which raises a need of caching most frequent tokens. Inspired by the work related to cache-based LMs (Kawakami et al., 2017) and ghost batches (Hoffer et al., 2017), we only consider the topk tokens in the main network memory and keep track of gradients of less recently used token embeddings (commonly known as LRU, Least Recently Used policy). This could be thought of as

Segmentation and Disambiguation
This optimal sequencing problem can be formulated as an overlapping subsequence approach, which can be solved using LM based Viterbi (Nagata, 1994;Sproat et al., 1996). Fig. 2 explains this process of joint subword modelling. Here, we take Eq. (6) as the objective function for finding the mapping sequence, however, we use subword perplexities (Cotterell et al., 2018;Mielke, 2019) in our implementation. For the TC LSTM, we add the probabilities of the beams of the possible sequences. As discussed in §2.1 and Eq. (7), beam search is needed to select the best subword sequence for TC. Once the sentences are tokenized, the mapping table is used to convert each SC token to the corresponding TC token. We extract the final TC sentence by resolving ambiguities through beam search using the TC LSTM (Fig. 2).

Dataset for Intrinsic Evaluation
We construct a gold standard corpus for both Chinese scripts consisting of 4 domains: HK Literature and Newswire, and Taiwanese Literature and Newswire (Table 2) with each domain containing 3000 sentences. SC-TC mapping tables are constructed from existing resources (Denisowski, 2019;Chu et al., 2012). We heuristically convert selected TC sentences to SC using OpenCC. We asked the annotators to manually correct any incorrect conversions. 4 4 A detailed data statement is given in the appendix.

Language Model Training
We choose the SIGHAN-2005 Bakeoff dataset to train the segmentation-based language model (Emerson, 2005). For SC, we select the PKU and MSR partitions, and for TC, we use the Academia Sinica and CityU partitions. We apply maximal matching (or heuristic dictionary-based word segmenter) to pre-process these datasets by segmenting words into subwords (Wong and Chan, 1996). Here, 'dictionary' refers to the word-list in the mapping table. We then train a 2-layer LSTM language model LSTM with tied weights, and embedding and hidden sizes of 512 (Sundermeyer et al., 2012) on this segmented dataset with subsequence sampling and stochastic tokenization as discussed in §2.2.

Baselines and Ablations 5
We implement the following baselines for the experimentation: Off-the-shelf Converters: Hanziconv 6 and Mafan 7 are dictionary-based script character converters. Evaluating this could be useful to understand the lower accuracy bound. OpenCC 8 uses a hybrid of characters and words (specifically trie based tokenizer) for script conversion (Pranav A et al., 2019).
Language Model Disambiguation: A strong baseline to this problem would be to build a language model to disambiguate between the characters, which is quite similar to STCP (Xu et al., 2017). We use a 2-layer LSTM language model trained on Traditional Chinese corpus.
Neural Sequence Models: We heuristically convert Traditional Chinese Wikipedia to Simplified Chinese using OpenCC and use it for training the seq2seq model (Sutskever et al., 2014). We  Table 3: Results of the intrinsic evaluation experiments which are reported as a mean across 10 different seeds. We use disambiguation error density (DED, the lower, the better) and sentence accuracy (SA, the higher the better) metrics for evaluation. Bold: best, Underlined: second-best.
construct a 20-layer neural convolutional sequence model (Gehring et al., 2017) (both in encoder and decoder) using fairseq (Ott et al., 2019). We perform ablation tests by inserting following segmentation models.
Word tokenization: We use Jieba, which is a commonly used hidden markov model based word tokenizer for Chinese NLP. 9 Dictionary substrings: We apply maximal string matching, which is a dictionary based greedy tokenizer (Pranav A et al., 2019;Wong and Chan, 1996).

Results for Intrinsic Evaluation
We evaluate our models using the metrics of disambiguation error density (DED) and sentence accuracy (SA). DED is the average of total edit distances per 1000 ambiguous Simplified characters, which is ∑ edit distances ∑ ambiguous Simplified characters × 1000. SA is the number of sentences correctly converted in percentages. Contrary to previous papers, we do not report character based accuracy values, as generally most characters have straightforward mappings -a reason why we opt for a less forgiving metric like SA where every character in a sentence has to be correctly converted.
Results are shown in Table 3, broken down by domain, and overall. Our model attains an average DED of 3.0 and a SA of 92.4% overall, whereas the best existing converter, OpenCC, only achieves a DED of 4.3 and a SA of 85.3%. We 9 https://github.com/fxsjy/jieba find that seq2seq and LM based disambiguation perform almost on par with OpenCC, due to the large number of false positive errors by these models. Jieba achieves an average DED of 11.2 as it does not handle OOV words well. For maximal matching of segmented words and Unigram subwords, it achieves an overall DED of 4.5 and 3.7, respectively -showing that joint segmentation yields better results. We observe that accuracy values are slightly worse on news text, due to the relatively high number of new entities in those datasets. We find that seq2seq and LM based disambiguation gives rise to many false positives. Heuristically converting TC to SC results in certain conversion errors in the training dataset; and additionally, seq2seq approaches tend to reword the target sentence, which shows that they are unsuitable for this task.

Qualitative Error Analysis
We manually inspect incorrect conversions in the intrinsic evaluation and find four interesting recurring linguistic patterns which confused the converters. We instructed the annotators to classify the items in the dataset (overall 12000 sentences in intrinsic evaluation dataset) if the sentences contain any of these patterns. In Table 4, we provide an overview of statistical information of these patterns and the performance by the converters.
Code mixing: Vernacular Cantonese characters (zh-yue) are a subset of TC characters but do not follow the norms of the standard written Chinese (Snow, 2004). We find that some of the sentences in our dataset are code-mixed with zhyue (e.g. speech transcription) or English (e.g. named entities). Consider the snippet, "... 古惑架 BENZ 190E 撞埋支...", which is code-mixed with  both zh-yue and English. The characters "BENZ 190E", 架 and 埋支 are not a part of standard written Chinese. We find that OOV words are 2kenized into single-character tokens which results in: "古 惑 |架 |B|E|N|Z| 1|9|0|E| 撞 |埋|支" Thus, 2kenize distributes the entropy over multiple tokens rather than a single token (generally UNK is used in such cases). This allows the language model to have more space for multiple guesses, which shows a massive advantage over word models or just UNKing it, a reason why subword tokenizers outperform closed-vocabulary models (Merity, 2019).
Disguised Named Entities: Take the recurring sentence: "维护发展中国家共同利益". Observe that the sentence contains a frequent word 中国 (China). However, the actual meaning and English translation do not include "China" at all. This is an interesting linguistic trait of Chinese, where words often appear in the sentence, but are not being interpreted. This could easily trip up a tokenizer, as the probability of 中国 being a token independently is high. Having 中国 as a separate token in the sentence could lead into an incorrect conversion (Table 1). We find in 2kenizer's trellis 10 that "维 护 |发展 | 中" has a higher probability than other possible segmentations. Substructure lookups and beam search in our setup considerably reduces the probability of getting wrong tokenization. The sen- Repetitions: We find that in 3.57% of sentences, named entities are repeated. Interestingly, STCP, which uses a language model for disambiguation, often only converts one out of the repeated tokens correctly, which we can see in the table. As also shown, STCP prefers 佐治亞 over 喬治 亞 in the first occurrence, but then prefers 喬治亞 11 in the second occurrence as it gets more context. 2kenize converts both of the entities correctly, very likely due to substructure lookups.
Failure Cases: Dictionary-based converters (OpenCC, HanziConv and Mafan) only use the first conversion candidate 12 if multiple candidates are available. STCP often converts named entities wrongly, especially the ones which have longrange dependencies and repetitions. Although we find that 2kenize converts some of the unseen named entities perfectly, some of the errors caused were due to infrequent characters. Few cases are mainly related to variant characters 13 which are often used interchangeably.

Extrinsic Evaluation
An accurate script converter should produce a less erroneous dataset, which should in turn improves the accuracy of the downstream tasks. In this section, we demonstrate the effect of script conversion on topic classification tasks to examine this assumption. We also study the impact of tokenization and pooling on the accuracy of topic classification. We apply the converter to the language modelling corpus (Wikitext), then train a classifier for informal and formal topic classification on that translated data. This allows us to measure the performance of the converter compared to other ones for a specific downstream task.

Dataset for Extrinsic Evaluation
This section describes the dataset that we used for extrinsic evaluation experiments. It involves a pretraining dataset which consists Chinese Wikipedia and topic classification datasets.

Pretraining Dataset
We use Chinese Wikipedia articles for pretraining the language model. Script conversion is an issue in Chinese Wikipedia, and currently, they use a server-side mechanism to automatically convert the scripts (dictionary-based) based on the location of the user. However, Wikipedia provides an option to view the article without conversion, which we use in the corpus. 14 We use zh-CN, zh-HK and zh-yue wikis to retrieve articles originally written SC, TC and vernacular Cantonese + TC respectively with the help of wikiextractor 15 . We pretrain the formal text classification models on articles from zh-HK and converted zh-CN; and classification models for informal text on articles from zh-HK, zh-yue, and converted zh-CN.

Performance of various classifiers
For classification baselines, we use characterbased SVM (Support Vector Machines, Joachims We employ the cosine cyclic learning scheduler (Smith, 2015), where the limits of learning rate cycles are found by increasing the learning rate logarithmically and computing the evaluation loss for each learning rate (Smith, 2018). To compute the batch size, we apply gradient noise scale to each batch size candidate and pick the one which gives the highest gradient noise scale (McCandlish et al., 2018). We apply label smoothing (Szegedy et al., 2015) and use mixed precision training on RTX 2080. We implement our experiments using Pytorch (Paszke et al., 2019) and FastAI (Howard and Gugger, 2020).
MultiFiT uses concat pooling after the last layer of QRNN, which means that the last time step is concatenated with an average and maximum  We fine-tune the BERT language model and pretrain the MultiFiT language model on Chinese Wikipedia subsets ( §4.1.1). All classifiers are then trained on the given training set (character based models) and evaluated on the test set in terms of accuracy as number of items in each class are roughly equal. This experiment (and subsequent experiments in this section) is repeated across ten different seeds (Reimers and Gurevych, 2018) and data splits (Gorman and Bedrick, 2019) and the results are shown in Table 6. Layer pooling shows an absolute improvement of 0.4% improvement over concat pooling on formal and informal topic clas-   sification, thus confirming our hypothesis.

Effect of Conversion on Classification
For each converter (OpenCC, STCP, 2kenize), we translate zh-CN wiki dataset and augment it with the TC wiki dataset. Then, we pretrain on this dataset, finetune on the domain data and train Mul-tiFiT with layer pooling on these three datasets. We demonstate test set accuracies in Table 7. The dataset translated by 2kenize outperforms other converters, giving an absolute improvement of 0.9 % on formal and 4.5% over second-best converters on informal topic classification. These results emphasise that better script conversion improves the quality of the pretraining dataset, which boosts the performance of the downstream tasks like topic classification. Subword tokenizers mostly rely on frequency and do not take likelihood (something similar to n-gram language model) of tokenized sentence into consideration. Hence, we choose LMbased Viterbi segmentation (henceforth referred as 1kenize), and here the LM would be the TC LSTM described in §2.2. We report results in Table 8. We find that for formal classification, 1kenize and Unigram perform best. 1kenize outperforms other subword tokenizers for the noisier informal dataset, giving an absolute improvement of 0.5% over the second best method, which is BPE-Drop.

Effect of Tokenization on Classification
We plot a log frequency of tokens vs log order rank, which is shown in Figure 4. This distribution is based on the LIHKG dataset, which is noisier than other domains. We observe that character and word distributions are steeper than language model based subword tokenizers. This indicates that subword tokenizers produce a less skewed token distribution. Subword tokenizers like BPE and Unigram are deterministic and rely on frequency for segmentation. Since 1kenize is contextual, being LM-based, we find that it produces the least skewed distribution (lowest Zipf's law coefficient (Zipf, 1949)), which also reduces variance, a reason why this simple segmentation method outperforms others for informal text classification.

Takeaways and Open Questions
The contributions of our work are: • 2kenize, a subword segmentation model, which jointly segments source sentence and its corresponding approximate target conversions. • An unsupervised script converter based on 2kenize which shows a significant improvement over existing script converters and supervised models. • 1kenize, a variant of 2kenize which performs tokenization on only Traditional Chinese sentences which improves accuracy on topic classification tasks. • Character conversion evaluation datasets: spanning Hong Kong and Taiwanese literature and news genres. • Traditional Chinese topic Classification datasets: formal (scraped from Singtao) and informal (scraped from LIHKG) styles spanning genres like news, social media discussions, and memes. The key findings of our work are: • Our script converter shows a strong performance when dealing with code mixing and named entities. Supervised models are prone to anaphora and unseen entities related errors. • A simple LM-based Viterbi segmentation model outperforms other subword tokenizers on topic classification tasks and reduces skewness of token distribution on a noisy dataset. We leave some open questions: • How can we exploit subword variations to reduce skewness in the NLU tasks? • Would subword-segmentation-transfer be helpful for other NMT-NLU task pairs like we did for 2kenize (script conversion) to 1kenize (classification)? We anticipate that this study would be useful to TC NLP practitioners, as we address several research gaps, namely script conversion and a lack of benchmark datasets.

B.1 Corpus
In this subsection, we discuss the annotation procedure and the characteristics of the corpus used for the intrinsic evaluation. We have used Bender and Friedman (2018) data statement design for the description.

B.1.1 Curation Rationale
The script conversion task is understudied in NLP and we could not find good quality parallel corpora to evaluate our approaches. The idea is to curate a diverse collection of TC works and convert them to SC, due to its one-to-one correspondence. However, we find out that some of the conversions were wrong because 1. sometimes dictionaries resulted in incorrect conversion, 2. stylistic differences between HK and TW characters and phrasing, 3. code-mixing of Cantonese and Traditional Chinese, 4. code-mixing with non-Chinese characters, 5. some characters in TC-SC conversion have one-to-many mappings as well. Hence, we need quality control with human annotators to validate our conversions.

B.1.2 Annotation Process
Demographic: We opted for 4 trained annotators, 2 for annotating HK-style TC and 2 for annotating TW-style TC and thus going for double annotation for the corpus. They ranged in age from 18-20 years, included 2 men and 2 women, gave their ethnicity as Hong Kongers (2) and Taiwanese (2), and their native spoken languages were Cantonese (2) and Taiwanese Mandarin (2).
Workload: Annotators approximately validated 100 sentences per hour, comprising of total workload of 60 hours. They were given a month to annotate and were paid 5000 Hong Kong Dollars on completion.
Procedure: The annotators were shown TC and converted SC sentences (we used OpenCC to convert) and were asked to validate and correct any conversion mistakes. In case of disagreement, we used majority voting between automatically converted and annotators' corrections.
We provide raw agreement and Krippendorf' s α in Table 1 for pooled data and various sub-groups of the dataset. We also report inter-annotator agreements on character and phrasal levels in Table 2.
These agreement values are difficult to interpret, but generally α ≥ 0.8 is considered to be substantial.

B.1.3 Speech Situation
The publication dates and sources are listed in the Table 2. HK and TW literature consists of popular books for which many movie and drama adaptations are made. 18 Specifically, for HK literature, the text contains code-mixed characters with Vernacular Cantonese, which is quite unusual in formal publishing practices, and these books are often cited as an example for popularizing Cantonese in the 60s (Snow, 2004). We also found code-mixing with English and numerous transliterated named entities which we have used for qualitative error analysis in the Table 4.

B.1.4 Text Characteristics
Although Hong Kong and Taiwan both use Traditional Chinese, they are stylistically different as the dominant spoken language in HK is Cantonese and in TW is Taiwanese Mandarin. Thus, it is quite essential to test the performance of our algorithms on these two styles. We collected two genres for each style: informal literature and formal news. We found more variation within informal HK-TW literature as compared to the formal news. We intentionally chose long sentences (average length of 200 characters), especially which contain more ambiguous characters to make the dataset more challenging for testing.

C Data Statement for Extrinsic Evaluation
This subsection describes the characteristics of the topic classification in Traditional Chinese. For the 18 We highly recommend these movies and novels as well.
short overview, please see Table 5.

C.1 Curation Rationale
We choose two different styles for curating this dataset: formal and informal. The formal text consists of news dataset scraped from Singtao, one of the popular newswire in Hong Kong. The classes in this dataset consist of Financial, Educational, Local, International, and Sports subsections. There are 17500 unlabelled and 3900 labelled items in this section. Authors would like to credit I-Tsun Cheng for giving us helpful suggestions in curating this dataset.
The informal text consists of social media posts dataset scraped from LIHKG, a Twitter equivalent in Hong Kong. The classes in this dataset consist of Sports, Opinions, Memes, IT, Financial and Leisure. There are 21000 unlabelled and 4900 labelled items in this section. Authors would like to credit Leland So for giving us helpful suggestions in curating this dataset.

C.2 Language Variety
The texts in the formal subsection are typically written in Hong Kong style Traditional Chinese (zh-hant-hk). The posts scraped from LIHKG are predominantly in Traditional Chinese (zh-hanthk), and they are often code-mixed with Vernacular Cantonese (zh-yue) and English (en-HK).

C.3 Speaker Demographic
Speakers were not directly approached for inclusion in this dataset and thus could not be asked for demographic information. Our best guess for demographic of LIHKG forum users are typically university students (19-23 years), and the majority of them speak Cantonese as a native language.

C.4 Text Characteristics
The news articles are scraped from 2017-2019 and LIHKG posts are scraped from 2017-2018. Some of the posts in LIHKG are in the transliterated Cantonese form and some of them are not written in Standard Written Chinese. The news posts are generally quite long and often contains more than

D.1 Heuristic Grid Search of Learning Rate and Batch Size Hyperparameters
We employ the cosine cyclic learning scheduler (Smith, 2015), where the limits of learning rate cycles are found by increasing the learning rate logarithmically and computing the evaluation loss for each learning rate (Smith, 2018). To compute the batch size, we apply gradient noise scale to each batch size candidate and pick the one which gives the highest gradient noise scale (McCandlish et al., 2018).

D.2 Training of SC and TC Language Model
The

D.3 Training of Convolutional seq2seq
Training dataset is a heuristically converted Traditional Chinese Wikipedia with OpenCC. We use 20 layers in encoder and decoder with the embedding size of 512 implemented in Fairseq (Ott et al., 2019). Dropout is 0.1 and we use adaptive softmax to speed up the training. The training took 1 day on RTX 2080 with FP16 training, with a batch size of 128 and number of epochs of 250.

E.1 Character CNN training
The datasets are described in §4.1.2. The model architecture is 7-layer CNN with tied weights and residual blocks. Embedding size is 512 and hidden size is 512. We perform a concat pooling in the last layer where we concatenate the last output of the word, mean pool and max pool of all representations. The training took 16 hours on RTX 2080 with FP16 training, with a batch size of 256 and number of epochs of 350.

E.3 MultiFiT training
We found MultiFiT is highly reproducible as compared to other models as it gives the least variance across the seeds and data splits. Hyperparameters are chosen by heuristic grid search on learning rate and batch size. The datasets are described in §4.1.2. Pretraining language model takes 1 GPU day for each experiment of MultiFiT. Finetuning language model takes 3 hours where we used a patience of 2 epochs. Finetuning classifiers takes 3 hours where we used a patience of 2 epochs. All experiments of MultiFiT are implemented using Fas-tAI (Howard and Gugger, 2020).

F Alternative texts for figures and Chinese explanations
F.1 Alternative text for Figure 1 The recurring Chinese sentence is split and we take one subsequence of it. The other subsequence is used in next iteration. We perform Unigram viterbi segmentation on this and get the probabilities. The probabilities are normalized and we sample a segmentation using this probability. This segmentation goes into the model which goes through cached embeddings, followed by stacked LSTM layers, followed by concat pooling (which consists of last output, mean pooling and max pooling) which then goes through a linear layer. We cache the top-k embeddings in the main memory and for the least frequent embeddings we track the gradients and do not keep them in the main network (we used gradient accumulation).

F.2 Alternative text for Figure 2
From the given SC sentence, we create possible TC sequences using mappings. We input these to Viterbi, which recursively calls LSTM. Using Eq. (6) as the scoring function, Viterbi outputs the mapping sequence. We perform beam search to find the best TC sequence from the mapping sequence where we used the same TC LSTM again.

F.3 Alternative text for Figure 3
The architecture contains 4 stacked QRNN layers. Each layer has QRNN cells. After every layer we perform a concat pool (taking the last output, max pool and mean pool). We aggregate these pools in the final layer which goes into a linear layer. We highly recommend this for making the training more stable.

F.4 Alternative text for Figure 4
We have plotted log-log token distribution. On xaxis we have order rank and on y-axis we have frequencies.
Character based tokenization gives a slope of 1.703, BPE-Drop gives 1.31, BPE gives 1.27, word tokenization (Jieba) gives 1.41, unigram sampling gives 1.28 and 1kenize gives the least skewed distribution with a slope of 1.1. Note that these are negative slope and lower the slope is, more efficiently vocabulary is tokenized.

F.5 Recurring Chinese sentence
Here, we explain the recurring sentence in this paper. In Table 1 we had SC sentence 维护发展中 国 家 共 同 利 益, which means Safeguarding the common interests of developing countries. This is pronounced as wéi hù fā zhǎn zhōng guó jiā gòng tóng lì in Mandarin. Its correct TC translation is 維護發展中國家共同利益, which is pronounced as wai4 wu6 faat3 zin2 zung1 gwok3 gaa1 gung6 tung4 lei6 jik1 in Cantonese (note that the numerals are the tones).