A Transition-based Model for Joint Segmentation, POS-tagging and Normalization

We propose a transition-based model for joint word segmentation, POS tagging and text normalization. Different from previous methods, the model can be trained on standard text corpora, overcoming the lack of annotated microblog corpora. To evaluate our model, we develop an annotated corpus based on microblogs. Experimental results show that our joint model can help improve the performance of word segmentation on microblogs, giving an error reduction in segmentation accuracy of 12.02%, compared to the traditional approach.


Introduction
Microblogs, such as Twitter, SMS and Weibo, has become an important research topic in NLP. Previous work has shown that off-the-shelf NLP tools can perform poorly on microblogs (Foster et al., 2011;Gimpel et al., 2011;Han and Baldwin, 2011). One of the major challenges for microblog processing is the issue of informal words. For example, "tmrw" has been frequently used in tweets for "tomorrow", causing OOV problems.
Text normalization has been introduced as a pre-processing step for microblog processing, which transforms informal words into their standard forms. Most work in the literature focuses on English microblog normalization, treating it as a noisy channel problem (Pennell and Liu, 2014;Cook and Stevenson, 2009;Yang and Eisenstein, 2013) or a translation problem (Aw et al., 2006;Contractor et al., 2010;Li and Liu, 2012;Zhang et al., 2014c), and training models based on words.
Lack of annotated corpora, text normalization is more challenging for Chinese. Unlike English, Chinese informal words are more difficult * corresponding author to mechanically normalize for two main reasons. First, Chinese does not have word delimiters. Second, Chinese informal words manifest diversity, such as abbreviations, neologisms, unconventional spellings and phonetic substitutions. Intuitively, there is mutual dependency between Chinese word segmentation and normalization, and therefore two tasks should be solved jointly.  proposed a joint model to process word segmentation and informal word detection. However, text normalization was not included in the joint model. Kaji et al (2014) proposed a joint model for word segmentation, POS tagging and normalization for Japanese Microblogs, which was trained on a partially annotated microblog corpus. Their method requires special annotation for text normalization, which can be expensive.
In this paper, we propose a joint model for Chinese text normalization, word-segmentation and POS tagging, which can be trained using standard segmentation and POS tagging annotation, overcoming the lack of an annotated corpus on Chinese microblogs. Our model is based on Zhang and Clark (2010), with an extended set of transition actions to handle joint normalization. In our model, word segmentation and POS tagging are based on normalized text transformed from informal text. Assuming that the majority of informal words can be normalized into formal equivalents (Han et al., 2012;Li and Yarowsky, 2008), we seek standard forms of informal words from an automatically constructed normalization dictionary. To evaluate our model, we developed an annotated corpus of microblog texts. Results show that our model achieves the best performances on three tasks compared with several baseline systems.

Text Normalization
Text normalization is a relatively new research topic. There are no precise definitions of a text normalization task that are widely accepted by researchers. The task is generally divided into three categories: lexical-level, sentence-level and discourse-level normalization. In this paper we focus on lexical-level normalization, which aims to transform informal words into their standard forms.
Lexical normalization can be regarded as a spelling correction problem. However, researches on spelling correction focus on typographic and cognitive/orthographic errors (Kukich, 1992), while text normalization focuses on lexical variants, such as phonetic substitutions, abbreviation and paraphrases.
Unlike English, for which informal words are detected according to whether they are out of vocabulary, Chinese informal words manifest diversity.  divided informal words into three types: phonetic substitutions, abbreviations and neologisms. Li and Yarowsky (2008) classified them into four types: homophone, abbreviation, transliteration and others. Due to variant characteristics, they normalise informal words by training a model per type, leading to increased system complexity.
Research reveals that most lexical variants have an unambiguous standard form (Han et al., 2012;Li and Yarowsky, 2008). The validity of this assumption is also empirically assessed on our corpus annotation in Section 6.1. Based on this assumption, we seek standard forms of informal words from a constructed normalization dictionary, avoiding diversity on informal words.

Transition-based Segmentation
We adapt the segmenter of Zhang and Clark (2007) as our baseline segmenter. Given an input sentence x, the baseline segmenter finds a segmentation by maximizing: where Gen(x) denotes the set of all possible segmentations for an input sentence. Zhang and Clark (2007) proposed a graphbased scoring model, with features based on complete words and word sequences. We adapt their method slightly, under a transition-based framework (Zhang and Clark, 2011), which gives us a consistent way of defining all models in this paper.  Here a transition model is defined as a quadruple M = (C, T, W, C t ), where C is a state space, T is a set of transitions, each of which is a function: C → C, W is an input sentence c 1 ... c n , C t is a set of terminal states. A model scores the output by scoring the corresponding transition sequence. As shown in Figure 1, a state is a tuple ST = (S, Q), where S contains partially segmented sequences, and Q = (c i , c i+1 , ..., c n ) is the sequence of input characters that have not been processed. When the character c i is processing, the transition system would operate one of two actions that are defined as follows: (1) APP(c i ), removing c i from Q, and appending it to the last (partial) word in S.
(2) SEP(c i ), removing c i from Q, making the last word in S as completed, and adding c i as a new partial word.

Joint Segmentation and Normalization
Our SN model extends the transition-based segmentation model. In addition to the actions APP and SEP, the transition system also contains a SEPS action, which substitutes an in formal word on the top of S if it exists in the normalization dictionary. Figure 2 gives a normalization transition process for the sentence "å\-¨'J (How great work pressure is!)". During processing the character "'(big)", the following actions can be applied.
Lexical substitution is based on a normalization dictionary whose entries consist of <lexical variant, standard form> pairs. The output is a pair of labeled sequences, containing the informal labeled sequence and the corresponding formal labeled sequence. To rank the candidates, both labeled sequences can be scored. However, lacking annotated corpora on informal texts, we only use the score of formal labeled sequence in our model. The advantage is that we can train our model by using standard corpus only, overcoming the lack of annotated corpora on informal texts.

Training and Decoding
We apply the global training and beam-search decoding framework of Zhang and Clark (2011). An agenda is used by the decoder to keep the N-best states during the incremental process. Before decoding starts, the agenda is initialized with the initial state. When a character is processed, existing states are removed from the agenda and extended with all possible actions, and the N-best newly generated states are put back onto the agenda. After all states have been terminal, the highest-scored state from the agenda is taken as the output.
Algorithm 1 shows pseudocode for the decoder. ADDITEM adds a new item into the agenda, N-BEST returns the N highest-scored items from the agenda, and BEST returns the highest-scored item from the agenda. GETNWORD returns a possible standard form set of last word, seeking from normalization dictionary. APP appends a character to the last word in a state, SEP joins a character as the start of a new word in a state, SEPS operates SEP and replaces the last word by a possible standard form.

Features
In the experiments, we use the segmentation feature templates of Zhang and Clark (2011). These features are effective for segmentation on formal text. However, for text normalization, these features contain insufficient information. Our experiments show that by using Zhang and Clark's features, the F-Score on normalization is only 0.4207.
Prior work has shown that the language statistic information is important for text normalization Li and Yarowsky, 2008;Kaji and Kitsuregawa, 2014). As a result, we extract language model features by using word-based language model learned from a large quantity of standard texts. In particular, 1-gram, 2-gram, 3-gram features are extracted. Every type of n-gram is divided into ten probability ranges. For example, if the probability of the word bigram: " ‹›-'" (high pressure) is in the 2 nd range, the feature is represented as "word-2-gram=2".
In our experiments, language models are trained on the Gigaword corpus 1 with SRILM tools 2 . To train a word-based language model, we segmented the corpus using our re-implementation of Zhang and Clark (2010). Results show that language model information not only improves the perfor-mance of text normalization, but also increases the performance of word-segmentation.

Extension for Joint Segmentation, Normalization and POS tagging 4.1 Joint Segmentation and POS Tagging
In order to reduce the error propagation of word segmentation, joint models have been applied to some NLP tasks, such as POS tagging (Zhang and Clark, 2010;Kruengkrai et al., 2009) and Parsing (Zhang et al., 2014a;Qian and Liu, 2012;Zhang et al., 2014b). We take the joint word segmentation and POS tagging model of Zhang and Clark (2010) as the joint baseline. It extends from transition-based segmenter, adding POS arguments to the original actions. In Figure 1, when the current character c i is processing, the transition system for ST would operate as follows : (1) APP(c i ), removing c i from Q, and appending it to the last (partial) word in S with the same POS tag, .
(2) SEP(c i , pos), removing c i from Q, making the last word in S as completed, and adding c i as a new partial word with a POS tag "pos".

Joint Segmentation, Normalization and POS Tagging
Our joint model extends the model of Zhang and Clark (2010) by adding a SEPS action, which substitutes formal word for last word in S if exists in the dictionary. On the other hand, it can also be regarded as an extension of the joint segmentation and normalization model, adding POS arguments to the original actions.
Using the same example shown in Figure 2, the following three actions can be applied for the character " ' big)": (1) APP("'(big)"), appending "'(big)" to the last word "-¨(yālí, pear)" in the informal labeled sequence, which remain with the same POS tag "NN".
(2) SEP("'(big)", VA), making the last word "-¨(yālí, pear)" in the informal labeled sequence as a completed word and adding "'(big)" as a new partial word with a POS tag "VA".
We use the same training and decoding framework as our joint segmentation, normalization and POS tagging model, as described in section 3.3.

Construction of Normalization Dictionary
Although large-scale normalization dictionaries are difficult to obtain, informal/formal relations could be extracted from large-scale web corpora (Li and Yarowsky, 2008), and informal words are mainly derived using fixed word-formation patterns. In this paper, we adopt two methods to construct a normalization dictionary. The first method is to extract informal/formal pairs from large-scale text. In general, many informal and formal words co-occur in the same texts or similar contexts. We can find their relations with text patterns. As shown in Table 1, the first example follows the "formal_ðinformal" ("_ ð" means "is also referred to as") definition pattern, while the second example follows the pattern "informal(formal)". This gives us a reliable way to seed and bootstrap a list of informal/formal pairs.
In the experiments, we constructed the normalization dictionary consisting of 32,787 informal/formal word pairs in total. The dictionary is used to tamper the formal training data for the joint segmentation and normalization systems with 25% of the formal words in the dictionary being replaced with their informal equivalents.

Microblog Corpus Annotation
To evaluate our model, we develop a microblog corpus. Our annotated corpus is collected from Sina Weibo 3 , which is the largest microblogging platform in China. More than 1,000,000 Chinese posts are crawled using Sina Weibo API. Among these, 4,000 posts were randomly selected. We follow Wang et al. (2012) and apply rules to preprocess the corpus'URLs, emoticons, "@usernames and Hashtags as pre-segmented words.
As a result, we obtain 2,000 sentences as a source of the corpus. Two human participants annotated the 2,000 sentences by using the tools we developed. The tools can simultaneously annotate word boundaries, POS and text normalization. We used the CTB scheme for word segmentation and POS tagging. We divided informal words into three types: Phonetic Substitutions, Abbreviation, Paraphrases. In total, we annotated 1,129 informal word-pairs in the 2,000 sentences, which contained 658 different informal words. Table 3 shows the frequency distribution and annotation agreement over three types of informal words in corpus. The Cohen's kappa is 0.95 for informal words annotation, which shows that it is easy for humans to distinguish informal words, and validates our assumption that informal word generally has one formal word equivalent.

Settings and Measures
Our model is trained on the Chinese Treebank (CTB) 7 4 , which is a large, word segmented, POS tagged and fully bracketed Chinese news corpus. The annotated microblog corpus is randomly divided into two parts: 1,000 sentences for development and 1,000 sentences for test.
The standard F-measure is used to measure the  accuracies of word segmentation, POS tagging and text normalization, where the accuracy is F = 2PR/(P+R). In addition, we use recall rates to evaluate the identification accuracies of formal, informal and all words. The recall rate of formal words N-R is defined as the percentage of gold standard output formal words that are correctly segmented, the recall rate of informal words I-R is defined as the percentage of gold-standard output informal words that are correctly segmented and the recall rate of all words ALL-R is defined as the percentage of gold standard output words that are correctly segmented.

Joint Segmentation and Normalization
Our development set is used to decide the beam size and the number of training iterations. The best performances on the development set are obtained when the beam size is set to 16 and the number of iterations is set to 32. Comparison with pipeline We investigate the influence of the language model and analyze the result compared to the baseline. Table 4 shows the results on the development and test sets, where SN model is joint model and S;N is pipeline model. Our SN model performs better on segmentation than pipeline S;N model, demonstrating the effectiveness of normalization. Table 5 shows the accuracies (i.e., recall rate) of formal and informal word identification on the development set. After normalization, the accuracy of informal word identification has a large improvement, and the accuracy of formal word identification also increases. This shows that formal words can be better recognized when informal words are identified correctly. It demonstrates that text normalization is effective for both informal words and formal words.
The effect of language model From Table 4, we observe that the performances increase when using language model features. Particularly, the  normalization accuracy improves more significantly. It indicates that statistical language model knowledge play an important role on text normalization. Using language model features, our SN model improves more in the segmentation F-Score compared with the baseline system. Furthermore, we also find that the language model features are helpful to identifying the formal words, as shown in Table 5. The identification accuracy of informal words increases on the SN model, while the accuracy decreases on the S;N model. Due to the relatively low frequency of informal words, they score lower on informal text by using the language model information, resulting in incorrect word segmentations. This illustrates that our joint model is more suitable for microblogs than the pipeline method.

Joint Segmentation, Normalization and POS tagging
We compare the following models on word segmentation, text normalization and POS tagging.
ST Our re-implementation of Zhang and Clark(2010). We investigate how the joint model contributes to improving accuracy of word segmentation and POS tagging in microblog domain.
S;N;T It is a pipe-line method for segmentation, normalization and POS tagging. The segmentation model does not use the features of POS. The normalization model uses segmentation information, but not features of POS. The POS tagging model does not need to segmentation.
SN;T It is another pipe-line method that first performs segmentation and normalization, then performs POS tagging. The SN model does not use the features of POS, and the POS tagging model does not need to segmentation.
SNT Our joint segmentation, normalization, and POS tagging model. Table 6 shows the final results on the test set. Previous work has shown that the systems trained on news data give poor accuracies of word segmentation and POS tagging in the microblog domain. As shown in Table 6,the F-Score of segmentation and POS tagging is 0.902 and 0.8163 respectively by using the Stanford segmenter and POS tagger.

Results
Comparing ST and SNT, we find that text normalization can enhance word segmentation and POS tagging in the microblog. SNT achieved larger improvements over the baseline with language features, reducing segmentation errors by 12.02% and POS errors by 3.63%.
Another goal of the experiment is to illustrate whether the three tasks benefit from each other. Comparing SN;T to S;N;T shows that the performance increases by join segmentation and normalization. It indicates that segmentation and text normalization benefit from each other. On other hand, our SNT model yields better performance than SN;T. It indicates that POS features are effective for segmentation and text normalization, and hence three tasks benefit from each other.
The effect of the normalization dictionary The dictionary plays an important role in our model, which reduces the number of OOV words. Intuitively, the performance is higher when the coverage of dictionary is larger. In the experiments, the coverage of our dictionary on the development and tests are 45.8%,48.2% respectively.
To investigate the effect of the dictionary on our model, we manually construct ten dictionaries from our development data, with coverage between 10% and 100%. Figure 3 shows the Fscore curves of test set on segmentation and POStagging for both SNT+lm and ST+lm model by different dictionaries. With the coverage of the dictionaries increasing from 10% to 100%, the F-score generally increases. When the coverage is greater than about 20%, the F-score for joint model is higher than for the baseline model.

Error Analysis
We found two major categories of errors. Abbreviation is sometimes incorrectly normalised, especially an informal word mapping to more than one formal word. For example, informal word " Žv" mapped to "ŽývÏ" (American idol), which consists of two words: "Žý" (American) and " vÏ" (idol). However, our model cannot normalise the word " Žv" in the experiment.  Another type of error is phonetic substitutions of numbers, which are sometimes identified incorrectly. For example. "7456" is identified as a number in the experiments, but it means " { †" (I'm so angry). To settle this problem, it needs more context information.

Results of Lexical Normalization
It is interesting to explore how well the joint model can normalize informal words. We compare our results with two existing systems on text normalization based on our annotated microblog corpus.
(1) WangDT We re-implement , which formalized the task as a classification problem and proposed rule-based and statistical features to model three plausible channels that explain the connection between formal and informal pairs. We use a single decision tree classifier  in the experiment.
(2) LYTop1 Li and Yarowsky (2008) formalized the task as a ranking problem and proposed a conditional log-linear model to normalization. In the experiment, we select top 1 as the standard form of informal word.
We use the same division with 1000 sentences for training and 1000 for test. The training data is used for both the WangDT and LY. We re-segment the corpus using Stanford tools for the two baselines. WangDT uses CRF to detection informal words and LYTop1 uses the informal words detected using our joint model.
Although it is a little unfair for the two baselines compared with our joint model, which uses the external knowledge -normalization dictionary. The experiments can partly reflect some conclusions. Table 7 shows the results of normalization by different systems. The performance of our model is the best among the three systems. In particular, the precision in our SNT model improves upon the baselines significantly. The main reason is that our model is based on global features over whole sentences, while the two baselines based on local windows features.

Related Work
There has much work on text normalization. The task is generally treated as a noisy channel problem (Pennell and Liu, 2014;Cook and Stevenson, 2009;Yang and Eisenstein, 2013;Sonmez and Ozgur, 2014) or a translation problem (Aw et al., 2006;Contractor et al., 2010;Li and Liu, 2012;Zhang et al., 2014c). For English, most recent work (Han and Baldwin, 2011;Gouws et al., 2011;Han et al., 2012) uses two-step unsupervised approaches to first detect and then normalize informal words. They aim to produce and use informal/formal word lexicons and mappings.
In processing Chinese informal text, Wong and Xia (2008) address the problem of informal words in bulletin board system (BBS) chats by employing pattern matching. Xia et al. (2005) also use SVM-based classification to recognize Chinese informal sentences chats. Both methods have their advantages: the learning-based method does better on recall, while the pattern matching performs better on precision. Li and Yarowsky (2008) tackle the problem of identifying informal/formal Chinese word pairs by generating candidates from Baidu search engine and ranking using a conditional log-linear model. Zhang et al. (2014c) analyze the phenomena of mixed text in Chinese microblogs, proposing a two-stage method to normalise mixed texts. However, their models employ pipelined words segmentation, resulting in reduced performance.  propose a joint model to process word segmentation and informal word detection. However, text normalization is split to another task . Our joint model process word segmentation, POS tagging and normalization simultaneously. Kaji et al. (2014) propose a joint model for word segmentation, POS tagging and normalization for Japanese Microblogs. Their model is trained on a partially annotated microblog corpus. In contrast, our model can be trained on existing annotated corpora in standard text.
Researchers have recently developed various microblog corpora annotated with rich linguistic information. Gimpel et al. (2011) andFoster et al. (2011) annotate English microblog posts with POS tags. Han and Baldwin (2011) release a microblog corpus annotated with normalized words. Duan et al. (2012) develop a Chinese microblog corpus annotated with segmentation for SIGHAN bakeoff.  release a Chinese microblog corpus for word segmentation and informal word detection. However, there are no microblog corpora annotated Chinese word segmentation, POS tags, and normalized sentences.
Our work is alse related to the work of word segmentation (Zhang and Clark, 2007;Zhang et al., 2013;Chen et al., 2015) and joint word segmentation and POS-tagging (Jiang et al., 2008;Zhang and Clark, 2010). A comprehensive survey is out of the scope of this paper, but interested readers can refer to Pei et al. (Pei et al., 2014) for a recent literature review of the fields.
To evaluate our model, we develop an annotated microblog corpus with word segmentation, POS tags, and normalization. Furthermore, we train our model by using a standard segmented and POS tagged corpus. We also present a comprehensive evaluation in terms of precision and recall on our microblog test corpus. Such an evaluation has not been conducted in previous work due to the lack of annotated corpora for Chinese microblogs.

Conclusion
We proposed a joint model of word segmentation, POS tagging and normalization, in which the three tasks benefit from each other. The model is trained on standard corpora, hence there is no need to retrain it for new microblog corpora. The results demonstrated that the model can improve the performance of word segmentation and POS tagging with text normalization on microblogs, and our model can benefit from the language statistical information, which is not suitable to segment word and tag POS directly for microblogs because of the relatively low frequency of informal words.
In our model, lexical substitution is based on a normalization dictionary, which avoids the diversity of informal words, simplifying this problem for real world applications. The codes of the joint model and data set are published at the website: https://github.com/ qtxcm/JointModelNSP.