Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data

In this paper, we propose efficient and less resource-intensive strategies for parsing of code-mixed data. These strategies are not constrained by in-domain annotations, rather they leverage pre-existing monolingual annotated resources for training. We show that these methods can produce significantly better results as compared to an informed baseline. Due to lack of an evaluation set for code-mixed structures, we also present a data set of 450 Hindi and English code-mixed tweets of Hindi multilingual speakers for evaluation.


Introduction
Code-switching or code-mixing is a sociolinguistic phenomenon, where multilingual speakers switch back and forth between two or more common languages or language varieties in a single utterance1 .The phenomenon is mostly prevalent in spoken language and in informal settings on social media such as in news groups, blogs, chat forums etc. Computational modeling of code-mixed data, particularly from social media, is presumed to be more challenging than monolingual data due to various factors.The main contributing factors are non-adherence to a standard grammar, spelling variations and/or back-transliteration.It has been generally observed that traditional NLP techniques perform miserably when processing code-mixed language data (Solorio and Liu, 2008b;Vyas et al., 2014;C ¸etinoglu et al., 2016).
More recently, there has been a surge in studies concerning code-mixed data from social media (Solorio and Liu, 2008a;Solorio and Liu, 2008a;Vyas et al., 2014;Sharma et al., 2016;Rudra et al., 2016;Joshi et al., 2016, and others).Besides these individual research articles, a series of shared-tasks and workshops on preprocessing and shallow syntactic analysis of code-mixed data have also been conducted at multiple venues such as Empirical Methods in NLP (EMNLP 2014 and2016), International Conference on NLP (ICON 2015 and2016) and Forum for Information Retrieval Evaluation (FIRE 2015 and2016).Most of these works are an attempt to address preprocessing issues-such as language identification and transliteration-that any higher NLP application may face in processing such data.
Due to paucity of annotated resources in codemixed genre, the performance of monolingual parsing models is yet to be evaluated on codemixed structures.This paper serves to fill this gap by presenting an evaluation set annotated with dependency structures.Besides, we also propose different parsing strategies that exploit nothing but the pre-existing annotated monolingual data.We show that by making trivial adaptations, monolingual parsing models can effectively parse codemixed data.

Parsing Strategies
We explore three different parsing strategies to parse code-mixed data and evaluate their performance on a manually annotated evaluation set.These strategies are distinguished by the way they use pre-existing treebanks for parsing code-mixed data.
• Monolingual: The monolingual method uses two separate models trained from the respective monolingual treebanks of the languages which are present in the code-mixed data.We can use the monolingual models in two different ways.Firstly, we can parse each code-mixed sentence by intelligently choosing the monolingual model based on the matrix language of the sentence.2A clear disadvantage of this method is that the monolingual parser may not accurately parse those fragments of a sentence which belong to a language unknown to the model.Therefore, we consider this as the baseline method.Secondly, we can linearly interpolate the predictions of both monolingual models at the inference time.The interpolation weights are chosen based on the matrix language of each parsing configuration.The interpolated oracle output is defined as: where f (•) is a softmax layer of our neural parsing model, φ(c m ) and φ(c s ) are the feature functions of the matrix and subordinate languages respectively and λ m is the interpolation weight for the matrix language (see Section §5 for more details on the parsing model).
Instead of selecting the matrix language at sentence level, we define the matrix language individually for each parsing configuration.We define the matrix language of a configuration based on the language tags of top 2 nodes in the stack and buffer belonging to certain syntactic categories such as adposition, auxiliary, particle and verb.• Multilingual: In the second approach, we train a single model on a combined treebank of the languages represented in the code-mixed data.This method has a clear advantage over the baseline Monolingual method in that it would be aware of the grammars of both languages of the code-mixed data.However, it may not be able to properly connect the fragments of two languages as the model lacks evidence for such mixed structures in the augmented data.This would particularly happen if the code-mixed languages are typologically diverse.
Moreover, training a parsing model on augmented data with more diverse structures will worsen the structural ambiguity problem.But we can easily circumvent this problem by including token-level language tag as an additional feature in the parsing model (Ammar et al., 2016).• Multipass: In the Multipass method, we train two separate models like the Monolingual method.However, we apply these models on the code-mixed data differently.Unlike Monolingual method, we use both models simultaneously for each sentence and pass the input to the models twice.There are two possible ways to accomplish this.We can first parse all the fragments of each language using their respective parsing models one by one and then the root nodes of the parsed fragments would be parsed by the matrix language parsing model.Or, we can parse the subordinate language first and then parse the root of the subordinate fragments with the fragments of matrix language using the matrix language parser.In both cases, monolingual parsers would not be affected by the cross language structures.More importantly, matrix language parser in the second pass would be unaffected by the internal structure of the subordinate language fragments.But there is a caveat, we need to identify the code-mixed fragments accurately, which is a non-trivial task.In this paper, we use token-level language information to segment tweets into subordinate or matrix language fragments.

Code-mixed Dependency Annotations
To the best of our knowledge, there is no available code-mixed data set that contains dependency annotations.There are, however, a few available code-mixed data sets that provide annotations related to language of a token, its POS and chunk tags.For an intrinsic evaluation of our parsing models on code-mixed texts, we manually annotated a data set of Hindi-English codemixed tweets with dependency structures.The code-mixed tweets were sampled from a large set of tweets of Indian language users that we crawled from Twitter using Tweepy 3 -a Twitter API wrapper.We used a language identification system (see §4) to filter Hindi-English codemixed tweets from the crawled Twitter data.Only those tweets were selected that satisfied a minimum ratio of 30:70(%) code-mixing.From this data set, we manually selected 450 tweets for annotation.The selected tweets are thoroughly checked for code-mixing ratio.While calculating the code-mixing ratio, we do not consider borrowings from English as an instance of codemixing.For POS tagging and dependency annotation, we used Universal dependency guidelines (De Marneffe et al., 2014), while language tags are assigned based on the tagset defined in (Solorio et al., 2014;Jamatia et al., 2015).The annotations are split into testing and tuning sets for evaluation and tuning of our models.The tuning set consists of 225 tweets (3,467 tokens) with a mixing ratio of 0.54 and the testing set contains 225 tweets (3,322 tokens) with a mixing ratio of 0.53.Here mixing ratio is defined as: where n is the number of sentences in the data set, H s and E s are the number of Hindi words and English words in sentence respectively.

Preprocessing
The parsing strategies that we discussed above for code-mixed texts heavily rely on language identification of individual tokens.Besides we also need normalization of non-standard word forms prevalent in code-mixed social media content and backtransliteration of Romanized Hindi words.Here we discuss both preprocessing steps in brief.
Language Identification We model language identification as a classification problem where each token needs to be classified into one of the following tags: 'Hindi' (hi), 'English' (en), 'Acronym' (acro), 'Named Entity' (ne) and 'Universal' (univ).For this task, we use the feedforward neural network architecture of Bhat et al. (2016) 4 proposed for Named Entity extraction in code mixed-data of Indian languages.We train the network with similar feature representations on the data set provided in ICON 2015 5 shared task on language identification.The data set contains 728 Facebook comments annotated with the five language tags noted above.We evaluated the predictions of our identification system against the gold language tags in our code-mixed development set and test set.Even though the model is trained on a very small data set, its prediction accuracy is still above 96% for both the development set and the test set.The results are shown in Table 1.

Normalization and Transliteration
We model the problem of both normalization and backtransliteration of (noisy) Romanized Hindi words as a single transliteration problem.Our goal is to learn a mapping for both standard and non-standard Romanized Hindi word forms to their respective standard forms in Devanagari.For this purpose, we use the structured perceptron of Collins (Collins, 2002) which optimizes a given loss function over the entire observation sequence.For training the model, we use the transliteration pairs (87,520) from the Libindic transliteration project6 and Brahmi-Net (Kunchukuttan et al., 2015) and augmented them with noisy transliteration pairs (63,554) which are synthetically generated by dropping noninitial vowels and replacing consonants based on their phonological proximity.
We use Giza++ (Och and Ney, 2003) to character align the transliteration pairs for training.
At inference time, our transliteration model would predict the most likely word form for each input word.However, the single-best output from the model may not always be the best option considering an overall sentential context.Contracted word forms in social media content are quite often ambiguous and can represent different standard word forms such as 'pt' may refer to 'put', 'pit', 'pat', 'pot' and 'pet'.To resolve this ambiguity, we extract n-best transliterations from the transliteration model using beam-search decoding.The best word sequence is then decoded using an exact search over b n word sequences7 scored by a tri-gram language model.The language model is trained on monolingual data using IRSTLM-Toolkit (Federico et al., 2008) with Kneser-Ney smoothing.For English, we use a similar model for normalization which we trained on the noisy word forms (3,90,000) synthetically generated from the English vocabulary.

Experimental Setup
The parsing experiments reported in this paper are conducted using a non-linear neural network-based transition system which is similar to (Chen and Manning, 2014).The models are trained on Universal Dependency Treebanks of Hindi and English released under version 1.4 of Universal Dependencies (Nivre et al., 2016).
Parsing Models Our parsing model is based on transition-based dependency parsing paradigm (Nivre, 2008).Particularly, we use an arc-eager transition system (Nivre, 2003).The arc-eager system defines a set of configurations for a sentence w 1 ,...,w n , where each configuration C = (S, B, A) consists of a stack S, a buffer B, and a set of dependency arcs A. For each sentence, the parser starts with an initial configuration where S = [ROOT], B = [w 1 ,...,w n ] and A = ∅ and terminates with a configuration C if the buffer is empty and the stack contains the ROOT.The parse trees derived from transition sequences are given by A. To derive the parse tree, the arc-eager system defines four types of transitions (t): 1) Shift, 2) Left-Arc, 3) Right-Arc, and 4) Reduce.Similar to (Chen and Manning, 2014), we use a non-linear neural network to predict the transitions for the parser configurations.The neural network model is the standard feed-forward neural network with a single layer of hidden units.We use 200 hidden units and RelU activation function.The output layer uses softmax function for probabilistic multi-class classification.The model is trained by minimizing cross entropy loss with an l2-regularization over the entire training data.We also use mini-batch Adagrad for optimization (Duchi et al., 2011) and apply dropout (Hinton et al., 2012).
From each parser configuration, we extract features related to the top four nodes in the stack, top four nodes in the buffer and leftmost and rightmost children of the top two nodes in the stack and the leftmost child of the top node in the buffer.

POS Models
We train POS tagging models using a similar neural network architecture as dis-cussed above.Unlike (Collobert et al., 2011), we do not learn separate transition parameters.Instead we include the structural features in the input layer of our model with other lexical and nonlexical units.We use second-order structural features, two words to either side of the current word, and last three characters of the current word.
We trained two POS tagging models: Monolingual and Multilingual.In the Monolingual approach, we divide each code-mixed sentence into contiguous fragments based on the language tags assigned by the language identifier.Words with language tags other than 'Hi' and 'En' (such as univ, ne and acro) are merged with the preceding fragment.Each fragment is then individually tagged by the monolingual POS taggers trained on their respective monolingual POS data sets.In the Multilingual approach, we train a single model on combined data sets of the languages in the codemixed data.We concatenate an additional 1x2 vector8 in the input layer of the neural network representing the language tag of the current word.Table 2 gives the POS tagging accuracies of the two models.Word Representations For both POS tagging and parsing models, we include the lexical features in the input layer of the Neural Network using the pre-trained word representations while for the non-lexical features, we use randomly initialized embeddings within a range of −0.25 to +0.25.9We use Hindi and English monolingual corpora to learn the distributed representation of the lexical units.The English monolingual data contains around 280M sentences, while the Hindi data is comparatively smaller and contains around 40M sentences.The word representations are learned using Skip-gram model with negative sampling which is implemented in word2vec toolkit (Mikolov et al., 2013).For multilingual models, we use robust projection algorithm of Guo et al. (2015) to induce bilingual representations using the monolingual embedding space of English and a bilingual lexicon of Hindi and English (∼63,000 entries).We extracted the bilingual lexicon from ILCI and Bojar Hi-En parallel corpora (Jha, 2010;Bojar et al., 2014).

Experiments and Results
We conducted multiple experiments to measure effectiveness of the proposed parsing strategies in both gold and predicted settings.In predicted settings, we use the monolingual POS taggers for all the experiments.We used the Monolingual method as the baseline for evaluating other parsing strategies.The baseline model parses each sentence in the evaluation sets by either using Hindi or English parsing model based on the matrix language of the sentence.For baseline and the Multipass methods, we use bilingual embedding space derived from matrix language embedding space (Hindi or English) to represent lexical nodes in the input layer of our parsing architecture.In the Interpolation method, we use separate monolingual embedding spaces for each model.The interpolation weights are tuned using the development set and the best results are achieved at λ m ranging from 0.7 to 0.8 (see eq. 1).The results of our experiments are reported in Table 3. Table 4 shows the impact of sentential decoding for choosing the best normalized and/or back-transliterated tweets on different parsing strategies (see §4).All of our parsing models produce results that are at-least 10 LAS points better than our baseline parsers which otherwise provide competitive results on Hindi and English evaluation sets (Straka et al., 2016). 10Among all the parsing strategies, the Interpolated methods perform com-10 Our results are not directly comparable to (Straka et al., 2016) due to different parsing architectures.While we use a simple greedy, projective transition system, Straka et al. ( 2016) use a search-based swap system.paratively better on both monolingual and codemixed evaluation sets.Interpolation method manipulates the parameters of both languages quite intelligently at each parsing configuration.Despite being quite accurate on code-mixed evaluation sets, the Multilingual model is less accurate in single language scenario.Also the Multilingual model performs worse for Hindi since its lexical representation is derived from English embedding space.It is at-least 2 LAS points worse than the Interpolated and the Multipass methods.However, unlike the latter methods, the Multilingual models do not have a run-time and computational overhead.In comparison to Interpolated and Multilingual methods, Multipass methods are mostly affected by the errors in language identification.Quite often these errors lead to wrong segmentation of code-mixed fragments which adversely alter their internal structure.
Despite higher gains over the baseline models, the performance of our models is nowhere near the performance of monolingual parsers on newswire texts.This is due to inherent complexities of code-mixed social media content (Solorio and Liu, 2008b;Vyas et al., 2014;C ¸etinoglu et al., 2016).

Conclusion
In this paper, we have evaluated different strategies for parsing code-mixed data that only leverage monolingual annotated data.We have shown that code-mixed texts can be efficiently parsed by the monolingual parsing models if they are intelligently manipulated.Against an informed monolingual baseline, our parsing strategies are at-least 10 LAS points better.Among different strategies that we proposed, Multilingual and Interpolation methods are two competitive methods for parsing code-mixed data.
The code of the parsing models is available at the GitHub repository https://github.com/irshadbhat/cm-parser, while the data can be found under the Universal Dependencies of Hindi at https://github.com/UniversalDependencies/UD_Hindi.
Tagging accuracies for monolingual and multilingual models.LID = Language tag, G = Gold LID, A = Auto LID.

Table 1 :
Language Identification results on code-mixed development set and test set.

Table 4 :
Parsing accuracies with exact search and k-best search (k = 5).CM d|t = Code-mixed development and testing sets.