Universal Dependencies Parsing for Colloquial Singaporean English

Singlish can be interesting to the ACL community both linguistically as a major creole based on English, and computationally for information extraction and sentiment analysis of regional social media. We investigate dependency parsing of Singlish by constructing a dependency treebank under the Universal Dependencies scheme, and then training a neural network model by integrating English syntactic knowledge into a state-of-the-art parser trained on the Singlish treebank. Results show that English knowledge can lead to 25% relative error reduction, resulting in a parser of 84.47% accuracies. To the best of our knowledge, we are the first to use neural stacking to improve cross-lingual dependency parsing on low-resource languages. We make both our annotation and parser available for further research.


Introduction
Languages evolve temporally and geographically, both in vocabulary as well as in syntactic structures. When major languages such as English or French are adopted in another culture as the primary language, they often mix with existing languages or dialects in that culture and evolve into a stable language called a creole. Examples of creoles include the French-based Haitian Creole, and Colloquial Singaporean English (Singlish) (Mian-Lian and Platt, 1993), an English-based creole. While the majority of the natural language processing (NLP) research attention has been focused on the major languages, little work has been done on adapting the components to creoles. One notable body of work originated from the featured translation task of the EMNLP 2011 Workshop on Statistical Machine Translation (WMT11) to translate Haitian Creole SMS messages sent during the 2010 Haitian earthquake. This work highlights the importance of NLP tools on creoles in crisis situations for emergency relief (Hu et al., 2011;Hewavitharana et al., 2011).
Singlish is one of the major languages in Singapore, with borrowed vocabulary and grammars 1 from a number of languages including Malay, Tamil, and Chinese dialects such as Hokkien, Cantonese and Teochew (Leimgruber, 2009(Leimgruber, , 2011, and it has been increasingly used in written forms on web media. Fluent English speakers unfamiliar with Singlish would find the creole hard to comprehend (Harada, 2009). Correspondingly, fundamental English NLP components such as POS taggers and dependency parsers perform poorly on such Singlish texts as shown in Table 2 and 4. For example, Seah et al. (2015) adapted the Socher et al. (2013) sentiment analysis engine to the Singlish vocabulary, but failed to adapt the parser. Since dependency parsers are important for tasks such as information extraction (Miwa and Bansal, 2016) and discourse parsing (Li et al., 2015), this hinders the development of such downstream applications for Singlish in written forms and thus makes it crucial to build a dependency parser that can perform well natively on Singlish.
To address this issue, we start with investigating the linguistic characteristics of Singlish and specifically the causes of difficulties for understanding Singlish with English syntax. We found that, despite the obvious attribute of inheriting a large portion of basic vocabularies and grammars from English, Singlish not only imports terms from regional languages and dialects, its lexical semantics and syntax also deviate significantly from English (Leimgruber, 2009(Leimgruber, , 2011. We categorize the challenges and formalize their interpretation using Universal Dependencies (Nivre et al., 2016), which extends to the creation of a Singlish dependency treebank with 1,200 sentences.
Based on the intricate relationship between Singlish and English, we build a Singlish parser by leveraging knowledge of English syntax as a basis. This overall approach is illustrated in Figure 1. In particular, we train a basic Singlish parser with the best off-the-shelf neural dependency parsing model using biaffine attention (Dozat and Manning, 2017), and improve it with knowledge transfer by adopting neural stacking (Chen et al., 2016;Zhang and Weiss, 2016) to integrate the English syntax. Since POS tags are important features for dependency parsing (Chen and Manning, 2014;, we train a POS tagger for Singlish following the same idea by integrating English POS knowledge using neural stacking. Results show that English syntax knowledge brings 51.50% and 25.01% relative error reduction on POS tagging and dependency parsing respectively, resulting in a Singlish dependency parser with 84.47% unlabeled attachment score (UAS) and 77.76% labeled attachment score (LAS).
We make our Singlish dependency treebank, the source code for training a dependency parser and the trained model for the parser with the best performance freely available online 2 . 2 https://github.com/wanghm92/Sing_Par 2 Related Work Neural networks have led to significant advance in the performance for dependency parsing, including transition-based parsing (Chen and Manning, 2014;Zhou et al., 2015;Weiss et al., 2015;Andor et al., 2016), and graph-based parsing (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017). In particular, the biaffine attention method of Dozat and Manning (2017) uses deep bi-directional long short-term memory (bi-LSTM) networks for highorder non-linear feature extraction, producing the highest-performing graph-based English dependency parser. We adopt this model as the basis for our Singlish parser.
Our work belongs to a line of work on transfer learning for parsing, which leverages English resources in Universal Dependencies to improve the parsing accuracies of low-resource languages (Hwa et al., 2005;Cohen and Smith, 2009;Ganchev et al., 2009). Seminal work employed statistical models. McDonald et al. (2011) investigated delexicalized transfer, where word-based features are removed from a statistical model for English, so that POS and dependency label knowledge can be utilized for training a model for lowresource language. Subsequent work considered syntactic similarities between languages for better feature transfer (Täckström et al., 2012;Naseem et al., 2012;Zhang and Barzilay, 2015).
Recently, a line of work leverages neural network models for multi-lingual parsing (Guo et al., 2015;Duong et al., 2015;Ammar et al., 2016). The basic idea is to map the word embedding spaces between different languages into the same vector space, by using sentence-aligned bilingual data. This gives consistency in tokens, POS and dependency labels thanks to the availability of Universal Dependencies (Nivre et al., 2016). Our work is similar to these methods in using a neural network model for knowledge sharing between different languages. However, ours is different in the use of a neural stacking model, which respects the distributional differences between Singlish and English words. This empirically gives higher accuracies for Singlish.
Neural stacking was previously used for cross-annotation (Chen et al., 2016) and crosstask (Zhang and Weiss, 2016) joint-modelling on monolingual treebanks. To the best of our knowledge, we are the first to employ it on cross-lingual feature transfer from resource-rich languages to improve dependency parsing for low-resource languages. Besides these three dimensions in dealing with heterogeneous text data, another popular area of research is on the topic of domain adaption, which is commonly associated with crosslingual problems (Nivre et al., 2007). While this large strand of work is remotely related to ours, we do not describe them in details.
Unsupervised rule-based approaches also offer an competitive alternative for cross-lingual dependency parsing (Naseem et al., 2010;Gillenwater et al., 2010;Gelling et al., 2012;Søgaard, 2012a,b;Martínez Alonso et al., 2017), and recently been benchmarked for the Universal Dependencies formalism by exploiting the linguistic constraints in the Universal Dependencies to improve the robustness against error propagation and domain adaption (Martínez Alonso et al., 2017). However, we choose a data-driven supervised approach given the relatively higher parsing accuracy owing to the availability of resourceful treebanks from the Universal Dependencies project.

Universal Dependencies for Singlish
Since English is the major genesis of Singlish, we choose English as the source of lexical feature transfer to assist Singlish dependency parsing. Universal Dependencies provides a set of multilingual treebanks with cross-lingually consistent dependency-based lexicalist annotations, designed to aid development and evaluation for cross-lingual systems, such as multilingual parsers (Nivre et al., 2016). The current version of Universal Dependencies comprises not only major treebanks for 47 languages but also their siblings for domain-specific corpora and dialects. With the aligned initiatives for creating transfer-learning-friendly treebanks, we adopt the Universal Dependencies protocol for constructing the Singlish dependency treebank, both as a new resource for the low-resource languages and to facilitate knowledge transfer from English.
On top of the general Universal Dependencies guidelines, English-specific dependency relation definitions including additional subtypes are employed as the default standards for annotating the Singlish dependency treebank, unless augmented or redefined when necessary. The latest English   (Bies et al., 2012), comprising of web media texts, which potentially smooths the knowledge transfer to our target Singlish texts in similar domains. The statistics of this dataset, from which we obtain English syntactic knowledge, is shown in Table 1 and we refer to this corpus as UD-Eng. This corpus uses 47 dependency relations and we show below how to conform to the same standard while adapting to unique Singlish grammars.

Challenges and Solutions for Annotating Singlish
The deviations of Singlish from English come from both the lexical and the grammatical levels (Leimgruber, 2009(Leimgruber, , 2011, which bring challenges for analysis on Singlish using English NLP tools. The former involves imported vocabularies from the first languages of the local people and the latter can be represented by a set of relatively localized features which collectively form 5 unique grammars of Singlish according to Leimgruber (2011). We find empirically that all these deviations can be accommodated by applying the existing English dependency relation definitions while ensuring consistency with the annotations in other non-English UD treebanks, which are explained with examples as follows. Imported vocabulary: Singlish borrows a number of words and expressions from its non-English origins (Leimgruber, 2009(Leimgruber, , 2011, such as "Kiasu", which originates from Hokkien meaning "very anxious not to miss an opportunity". 4 These imported terms often constitute out-of-vocabulary (OOV) words with respect to a standard English treebank and result in difficulties for using English-trained tools on Singlish. All borrowed words are annotated based on their usages in Singlish, which mainly inherit the POS from their genesis languages. summarizes all borrowed terms in our treebank. Topic-prominence: This type of sentences start with establishing its topic, which often serves as the default one that the rest of the sentence refers to, and they typically employ an object-subjectverb sentence structure (Leimgruber, 2009(Leimgruber, , 2011. In particular, three subtypes of topic-prominence are observed in the Singlish dependency treebank and their annotations are addressed as follows: First, topics framed as clausal arguments at the beginning of the sentence are labeled as "csubj" (clausal subject), as shown by "Drive this car" of (1) in Figure 2, which is consistent with the dependency relations in its Chinese translation.
Second, noun phrases used to modify the predicate with the absence of a preposition is regarded as a "nsubj" (nominal subject). Similarly, this is a common order of words used in Chinese and one example is the "SG" of (2) in Figure 2.
Third, prepositional phrases moved in front are still treated as "nmod" (nominal modifier) of their intended heads, following the exact definition but as a Singlish-specific form of exemplification, as shown by the "Inside tent" of (3) in Figure 2.
Although the "dislocated" (dislocated elements) relation in UD is also used for preposed elements, but it captures the ones "that do not fulfill the usual core grammatical relations of a sentence" and "not for a topic-marked noun that is also the subject of the sentence" (Nivre et al., 2016). In these three scenarios, the topic words or phrases are in relatively closer grammatical relations to the predicate, as subjects or modifiers.
Copula deletion: Imported from the corresponding Chinese sentence structure, this copula verb is often optional and even deleted in Singlish, which is one of its diagnostic characteristics (Leimgruber, 2009(Leimgruber, , 2011. In UD-Eng standards, predicative "be" is the only verb used as a copula and it often depends on its complement to avoid copular head. This is explicitly designed in UD to promote parallelism for zero-copula phenomenon in languages such as Russian, Japanese, and Arabic. The deleted copula and its "cop" (copula) arcs are simply ignored, as shown by (4) in Figure 2.
NP deletion: Noun-phrase (NP) deletion often results in null subjects or objects. It may be regarded as a branch of "Topic-prominence" but is a distinctive feature of Singlish with relatively high frequency of usage (Leimgruber, 2011). NP deletion is also common in pronoun-dropping languages such as Spanish and Italian, where the anaphora can be morphologically inferred. In one example, "Vorrei ora entrare brevemente nel merito." 5 , from the Italian treebank in UD, "Vorrei" means "I would like to" and depends on the sentence root, "entrare", with the "aux"(auxiliary) relation, where the subject "I" is absent but implicitly understood. Similarly, we do not recover such relations since the deleted NP imposes negligible alteration to the dependency tree, as exemplified by (5) in Figure 2.
Inversion: Inversion in Singlish involves either keeping the subject and verb in interrogative sentences in the same order as in statements, or tag questions in polar interrogatives (Leimgruber, 2011). The former also exists in non-English languages, such as Spanish and Italian, where the subject can prepose the verb in questions (La-housse and Lamiroy, 2012). This simply involves a change of word orders and thus requires no special treatments. On the other hand, tag questions should be carefully analyzed in two scenarios. One type is in the form of "isn't it?" or "haven't you?", which are dependents of the sentence root with the "parataxis" relation. 6 The other type is exemplified as "right?", and its Singlish equivalent "tio boh?" (a transliteration from Hokkien) are labeled with the "discourse" (discourse element) relation with respect to the sentence root. See example (6) in Figure 2.
Discourse particles: Usage of clausal-final discourse particles, which originates from Hokkien and Cantonese, is one of the most typical feature of Singlish (Leimgruber, 2009(Leimgruber, , 2011Lim, 2007). All discourse particles that appear in our treebank are summarized in Table A3 in Appendix A with the imported vocabulary:. These words express the tone of the sentence and thus have the "INTJ" (interjection) POS tag and depend on the root of the sentence or clause labeled with "discourse", as is shown by the "leh" of (3) in Figure 2. The word "one" is a special instance of this type with the sole purpose being a tone marker in Singlish but not English, as shown by (7) in Figure 2.

Data Selection and Annotation
Data Source: Singlish is used in written form mainly in social media and local Internet forums. After comparison, we chose the SG Talk Forum 7 as our data source due to its relative abundance in Singlish contents. We crawled 84,459 posts using the Scrapy framework 8 from pages dated up to 25th December 2016, retaining sentences of length between 5 and 50, which total 58,310. Sentences are reversely sorted according to the log likelihood of the sentence given by an English language model trained using the KenLM toolkit (Heafield et al., 2013) 9 normalized by the sentence length, so that those most different from standard English can be chosen. Among the top 10,000 sentences, 1,977 sentences contain unique Singlish vocabularies defined by The Coxford Singlish Dictionary 10 , A Dictionary of Singlish and Singapore English 11 , and the Singlish Vocabulary Wikipedia page 12 . The average normalized log likelihood of these 10,000 sentences is -5.81, and the same measure for all sentences in UD-Eng is -4.81. This means these sentences with Singlish contents are 10 times less probable expressed as standard English than the UD-Eng contents in the web domain. This contrast indicates the degree of lexical deviation of Singlish from English. We chose 1,200 sentences from the first 10,000. More than 70% of the selected sentences are observed to consist of the Singlish grammars and imported vocabularies described in section 3.2. Thus the evaluations on this treebank can reflect the performance of various POS taggers and parsers on Singlish in general.
Annotation: The chosen texts are divided by random selection into training, development, and testing sets according to the proportion of sentences in the training, development, and test division for UD-Eng, as summarized in Table 1. The sentences are tokenized using the NLTK Tokenizer, 13 and then annotated using the Dependency Viewer. 14 In total, all 17 UD-Eng POS tags and 41 out of the 47 UD-Eng dependency labels are present in the Singlish dependency treebank. Besides, 100 sentences are randomly selected and double annotated by one of the coauthors, and the inter-annotator agreement has a 97.76% accuracy on POS tagging and a 93.44% UAS and a 89.63% LAS for dependency parsing. A full summary of the numbers of occurrences of each POS tag and dependency label are included in Appendix A.

Part-of-Speech Tagging
In order to obtain automatically predicted POS tags as features for a base English dependency parser, we train a POS tagger for UD-Eng using the baseline model of Chen et al. (2016), depicted in Figure 3. The bi-LSTM networks with a CRF layer (bi-LSTM-CRF) have shown state-of-the-art performance by globally optimizing the tag sequence Chen et al., 2016).  Both the English and Singlish models consist of an input layer, a feature layer, and an output layer.

Base Bi-LSTM-CRF POS Tagger
Input Layer: Each token is represented as a vector by concatenating a word embedding from a lookup table with a weighted average of its character embeddings given by the attention model of Bahdanau et al. (2014). Following Chen et al. (2016), the input layer produces a dense representation for the current input token by concatenating its word vector and the ones for its surrounding context tokens in a window of finite size. Feature Layer: This layer employs a bi-LSTM network to encode the input into a sequence of hidden vectors that embody global contextual information. Following Chen et al. (2016), we adopt bi-LSTM with peephole connections (Graves and Schmidhuber, 2005).
Output layer: This is a CRF layer to predict the POS tags for the input words by maximizing the conditional probability of the sequence of tags given input sentence.

POS Tagger with Neural Stacking
We adopt the deep integration neural stacking structure presented in Chen et al. (2016). As shown in Figure 4, the distributed vector representation for the target word at the input layer of the Singlish Tagger is augmented by concatenating the emission vector produced by the English Tagger with the original word and character-based embeddings, before applying the concatenation within a context window in section 4.

Results
We use the publicly available source code 15 by Chen et al. (2016) to train a 1-layer bi-LSTM-CRF based POS tagger on UD-Eng, using 50-dimension pre-trained SENNA word embeddings (Collobert et al., 2011). We set the hidden layer size to 300, the initial learning rate for Adagrad (Duchi et al., 2011) to 0.01, the regularization parameter λ to 10 −6 , and the dropout rate to 15%. The tagger gives 94.84% accuracy on the UD-Eng test set after 24 epochs, chosen according to development tests, which is comparable to the stateof-the-art accuracy of 95.17% reported by Plank et al. (2016). We use these settings to perform 10fold jackknifing of POS tagging on the UD-Eng training set, with an average accuracy of 95.60%. Similarly, we trained a POS tagger using the Singlish dependency treebank alone with pretrained word embeddings on The Singapore Component of the International Corpus of English (ICE-SIN) (Nihilani, 1992;Ooi, 1997), which consists of both spoken and written texts. However, due to limited amount of training data, the  Figure 5: Base parser tagging accuracy is not satisfactory even with a larger dropout rate to avoid over-fitting. In contrast, the neural stacking structure on top of the English base model trained on UD-Eng achieves a POS tagging accuracy of 89.50% 16 , which corresponds to a 51.50% relative error reduction over the baseline Singlish model, as shown in Table 2. We use this for 10-fold jackknifing on Singlish parsing training data, and tagging the Singlish development and test data.

Dependency Parsing
We adopt the Dozat and Manning (2017) parser 17 as our base model, as displayed in Figure 5, and apply neural stacking to achieve improvements over the baseline parser. Both the base and neural stacking models consist of an input layer, a feature layer, and an output layer.

Base Parser with Bi-affine Attentions
Input Layer: This layer encodes the current input word by concatenating a pre-trained word embedding with a trainable word embedding and POS tag embedding from the respective lookup tables. Feature Layer: The two recurrent vectors produced by the multi-layer bi-LSTM network from each input vector are concatenated and mapped to multiple feature vectors in lower-dimension space by a set of parallel multilayer perceptron (MLP) 16 We empirically find that using ICE-SIN embeddings in neural stacking model performs better than using English SENNA embeddings. Similar findings are found for the parser, of which more details are given in section 6. 17 https://github.com/tdozat/Parser Singlish Parser output layer

Output layer
Input layer

Feature layer
Base English Parser  (2017), we adopt Cif-LSTM cells (Greff et al., 2016). Output Layer: This layer applies biaffine transformation on the feature vectors to calculate the score of the directed arcs between every pair of words. The inferred trees for input sentence are formed by choosing the head with the highest score for each word and a cross-entropy loss is calculated to update the model parameters.

Parser with Neural Stacking
Inspired by the idea of feature-level neural stacking (Chen et al., 2016;Zhang and Weiss, 2016), we concatenate the pre-trained word embedding, trainable word and tag embeddings, with the two recurrent state vectors at the last bi-LSTM layer of the English Tagger as the input vector for each target word. In order to further preserve syntactic knowledge retained by the English Tagger, the feature vectors from its MLP layer is added to the ones produced by the Singlish Parser, as illustrated in Figure 6, and the scoring tensor of the Singlish Parser is initialized with the one from the trained English Tagger. Loss is back-propagated by reversely traversing all forward paths to all trainable parameter for training and the whole model is used collectively for inference.

Experimental Settings
We train an English parser on UD-Eng with the default model settings in Dozat and Manning (2017).    Ammar et al. (2016), and the main difference is caused by us not using fine-grained POS tags. We apply the same settings for a baseline Singlish parser. We attempt to choose a better configuration of the number of bi-LSTM layers and the hidden dimension based on the development set performance, but the default settings turn out to perform the best. Thus we stick to all default hyper-parameters in Dozat and Manning (2017) for training the Singlish parsers. We experimented with different word embeddings, as with the raw text sources summarized in Table 3 and further described in section 6.2. When using the neural stacking model, we fix the model configuration for the base English parser model and choose the size of the hidden vector and the number of bi-LSTM layers stacked on top based on the performance on the development set. It turns out that a 1-layer bi-LSTM with 900 hidden dimension performs the best, where the bigger hidden layer accommodates the elongated input vector to the stacked bi-LSTM and the fewer number of recurrent layers avoids over-fitting on the small Singlish dependency treebank, given the deep bi-LSTM English parser network at the bottom. The evaluation of the neural stacking model is further described in section 6.3.  In order to learn characteristics of distributed lexical semantics for Singlish, we compare performances of the Singlish dependency parser using several sets of pre-trained word embeddings: GloVe6B, large-scale English word embeddings 18 ; ICE-SIN, Singlish word embeddings trained using GloVe (Pennington et al., 2014) on the ICE-SIN (Nihilani, 1992;Ooi, 1997) corpus; Giga100M, a small-scale English word embeddings trained using GloVe (Pennington et al., 2014) with the same settings on a comparable size of English data randomly selected from the English Gigaword Fifth Edition for a fair comparison with ICE-SIN embeddings. First, the English Giga100M embeddings marginally improve the Singlish parser from the baseline without pre-trained embeddings and also using the UD-Eng parser directly on Singlish, represented as "ENG-on-SIN" in Table 4. With much more English lexical semantics being fed to the Singlish parser using the English GloVe6B embeddings, further enhancement is achieved. Nevertheless, the Singlish ICE-SIN embeddings lead to even more improvement, with 13.78% relative error reduction, compared with 7.04% using the English Giga100M embeddings and 9.16% using the English GloVe6B embeddings, despite the huge difference in sizes in the latter case. This demonstrates the distributional differences between Singlish and English tokens, even though they share a large vocabulary. More detailed comparison is described in section 6.4.

Knowledge Transfer Using Neural Stacking
We train a parser with neural stacking and Singlish ICE-SIN embeddings, which achieves the best performance among all the models, with a UAS of 84.47%, represented as "Stack-ICE-SIN" in Table 4, which corresponds to 25.01% relative error reduction compared to the baseline. This demonstrates that knowledge from English can be successfully incorporated to boost the Singlish parser.
To further evaluate the effectiveness of the neural stacking model, we also trained a base model with the combination of UD-Eng and the Singlish tree-  Table 6: Error analysis with respect to grammar types bank, represented as "ENG-plus-SIN" in Table 4, which is still outperformed by the neural stacking model. Besides, we performed a 5-cross-fold validation for the base parser with Singlish ICE-SIN embeddings and the parser using neural stacking, where half of the held-out fold is used as the development set. The average UAS and LAS across the 5 folds shown in Table 5 and the relative error reduction on average 23.61% suggest that the overall improvement from knowledge transfer using neural stacking remains consistent. This significant improvement is further explained in section 6.4.

Improvements over Grammar Types
To analyze the sources of improvements for Singlish parsing using different model configurations, we conduct error analysis over 5 syntactic categories 19 , including 4 types of grammars mentioned in section 3.2 20 , and 1 for all other cases, including sentences containing imported vocabularies but expressed in basic English syntax. The number of sentences and the results in each group of the test set are shown in Table 6. The neural stacking model leads to the biggest improvement over all categories except for a tie UAS performance on "NP Deletion" cases, which explains the significant overall improvement.
Comparing the base model with ICE-SIN embeddings with the base parser trained on UD-Eng, which contain syntactic and semantic knowledge in Singlish and English, respectively, the former outperforms the latter on all 4 types of Singlish grammars but not for the remaining samples. This suggests that the base English parser mainly contributes to analyzing basic English syntax, while the base Singlish parser models unique Singlish grammars better.
Similar trends are also observed on the base model using the English Giga100M embeddings, but the overall performances are not as good as 19 Multiple labels are allowed for one sentence. 20 The "Inversion" type of grammar is not analyzed since there is only 1 such sentence in the test set.
using ICE-SIN embeddings, especially over basic English syntax where it undermines the performance to a greater extent. This suggests that only limited English distributed lexical semantic information can be integrated to help modelling Singlish syntactic knowledge due to the differences in distributed lexical semantics.

Conclusion
We have investigated dependency parsing for Singlish, an important English-based creole language, through annotations of a Singlish dependency treebank with 10,986 words and building an enhanced parser by leveraging on knowledge transferred from a 20-times-bigger English treebank of Universal Dependencies. We demonstrate the effectiveness of using neural stacking for feature transfer by boosting the Singlish dependency parsing performance to from UAS 79.29% to UAS 84.47%, with a 25.01% relative error reduction over the parser with all available Singlish resources. We release the annotated Singlish dependency treebank, the trained model and the source code for the parser with free public access. Possible future work include expanding the investigation to other regional languages such as Malay and Indonesian.