Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank

Code-mixing (CM) refers to the mixing of various linguistic units (morphemes, words, modifiers, phrases, clauses and sentences) primarily from two or more participating grammatical systems within a sentence [10]. The development of CM NLP systems has significantly gained importance in recent times due to an upsurge in the usage of CM data by multilingual speakers. However, this proves to be a challenging task due to the complexities created by the presence of multiple languages together. The complexities get further compounded by the inconsistencies present in the raw data on social media and other platforms. In this thesis, we explore methods to efficiently parse the immensely popular Bengali-English CM which is widely spoken in India and Bangladesh. We present a neural stack-based dependency parser for CM data of Bengali-English by utilizing pre-existing resources for closely related Hindi-English CM as well as monolingual treebanks for Bengali, Hindi, and English. To address the issue of scarcity of annotated resources for the Bengali-English CM pair, we present a rule-based system to computationally generate synthetic CM dataset from parallel treebanks of Bengali and English. The generated synthetic CM data does not require an overhead of preprocessing for language identification and normalization as we get orthographic norms from standard corpora. Next, we project automatic annotations from Bengali and Hindi-English treebanks to Bengali-English using simple heuristics creating a synthetic CM treebank for Bengali-English (Syn-BE). Incorporating Syn-BE into our neural stacking parser further improves its performance. Our final model achieves an accuracy of 89.63% for POS tagging and 76.24% UAS and 61.41% LAS points for dependency parsing. This model improves parsing results by 1.37% LAS points when compared with a stacking model utilizing pre-existing Hindi-English and other monolingual resources. We also present a dataset of 500 Bengali-English tweets annotated under Universal Dependencies scheme which can be utilized for evaluation of the system as well as providing seed training data.


Introduction
Code-mixing refers to the mixing of various linguistic units (morphemes, words, modifiers, phrases, clauses and sentences) primarily from two participating grammatical systems within a sentence (Bhatia and Ritchie, 2008). This is essentially different from code-switching which refers to the co-occurrence of speech extracts belonging to two different grammatical systems (Gumperz, 1982). The occurrence can be both inter-sentential or intra-sentential, however there are strict phrasal boundaries and within one lexical unit, the syntax of only one language is maintained. Since the more recent works have not focused on the differences between the two phenomena, we will use these two terms interchangeably.
Recently, code-mixing which was often only observed in speech, has pervaded almost all forms of communication due to the growing popularity and usage of social media platforms by multilingual speakers (Rijhwani et al., 2017). Therefore, there has been considerable effort in building CM NLP systems such as language identification (Nguyen and Dogruoz, 2013;Solorio et al., 2014;Barman et al., 2014;Rijhwani et al., 2017), normalization and back-transliteration (Dutta et al., 2015). Part-ofspeech (POS) and chunk tagging for code-mixing data for various South Asian languages with English have been attempted with promising results Nelakuditi et al., 2016). Ammar et al. (2016) developed a single multilingual parser trained on multilingual set of treebanks that outperformed monolingually-trained parsers for several target languages. In the CoNLL 2018 shared task, several participating teams developed multilingual dependency parsers that integrated cross-lingual learning for resource-poor languages and were evaluated on monolingual treebanks belonging to 82 unique languages (Zeman et al., 2018). However, none of these multilingual parsers have been evaluated on code-mixed data or adapted specifically for CM parsing.
The Bengali-English code-mixing is found in abundance as Bengali is widely spoken in India and Bangladesh. It is the second most widely spoken language in India after Hindi (Bhatia, 1982). Because of inherent structural and semantic similarity between Bengali and Hindi, we observe a close proximity between Bengali-English and Hindi-English code-mixing as well. Both of these language pairs deal with the challenges of mixing different typologically diverse languages; SOV word order 1 for Hindi/Bengali and SVO word order for English. A dependency parser for Hindi-English code-mixing has been presented by . In comparison, Bengali-English code-mixing is left relatively unexplored barring significant works on language identification (Das and Gambäck, 2014) and POS tagging (Jamatia et al., 2015) which serve as preliminary tasks for more advanced parsing applications down the pipeline. The main hindrance to the development of parsing technologies for Bengali-English stem from the lack of annotated resources for the code-mixing of this language pair. In this paper, we try to utilize the preexisting resources for widely available monolingual Bengali, Hindi and English as well as Hindi-English code-mixing and adapt them for Bengali-English dependency parsing. We also propose a rule based system to synthetically generate Bengali-English code-mixing data. An attempt has been made to generate code-mixing data for the Spanish-English language pair (Pratapa et al., 2018) but none for the Hindi-English or Bengali-English language pair as these pairs pose special challenges due to their different word orders which commonly violate most code-mixing theories (Sinha and Thakur, 2005). We further present a method to project dependency annotations to our Bengali-English CM data from monolingual Bengali and Hindi-English CM treebank and generate a synthetic treebank for Bengali-English (Syn-BE) which helps improve the accuracy of our dependency parser. For evaluation purpose, we present a dataset of 500 Bengali-English tweets annotated under Universal Dependencies scheme.

Data Preparation and Annotation
We prepared a dataset of 500 Bengali-English tweets by crawling over Twitter using Tweepy 2 -an API wrapper for Twitter. We identify the Bengali-English tweets by running the tweets through a language identification system (Bhat et al., 2018) trained on the dataset provided by ICON 2015. 3 We select only those tweets which satisfy a minimum code-mixing ratio of 30:70(%). Here, code-mixing ratio is defined as: where n is the number of sentences in the dataset, M s and E s are the number of words in the matrix and embedded language in sentence s respectively. Next, we manually select 500 tweets from the resulting tweets and normalize and/or transliterate each word before annotating them using Universal Dependency guidelines (Nivre et al., 2016) for POS and dependency tags. The language tags are annotated based on the tag set defined in (Solorio et al., 2014;Jamatia et al., 2015). Figure 1 illustrates the conventions followed by our annotators for unique code-mixed constructions. Bengali verbification of English verb start by adding a Bengali light verb hobe ("will be") leads to a hybrid compound verb start hobe ("will be"). Here, start is POS tagged as 'NOUN' instead of 'VERB' as it functions as a noun in this CM lexical unit and verbal inflection is observed only by the light verb hobe ("will be"). Also, #BOSS2 is tagged as 'PROPN' instead of 'X' as it is a syntactic token in this context. These annotations are consistent with the annotations for Hindi-English CM . The resulting dataset is split into three sets consisting of 200 tweets for testing, 160 for tuning and a third set of 140 tweets to be used as the training set in our stacking model for dependency parsing. The Bengali-English CM dataset is available at https://github.com/urmig/UD_bn-en.

Code Mixing Data Synthesis
Based on the token-level data distribution in Table 1, we observe that the matrix language in the majority of CM sentences is Bengali. The same is observed for the Hindi-English CM Data . With this assumption, we proceed with the synthetic data generation by mixing English linguistic elements into the matrix of Bengali sentences. A frequently observed phenomenon in CM data is replacement of noun phrases in one language by the corresponding noun phrase in the other language   (Dey and Fung, 2014). Sinha and Thakur (2005) had previously discussed CM constraints for Hindi-English and came to the conclusion that the phenomenon of code-mixing for this language pair is not entirely arbitrary. In our code-mixing method, we will be closely following the Closed Class Constraint which states that the matrix language elements within the closed class of grammar (possessives, ordinals, determiners, pronouns) are not allowed in code-mixing (Sridhar and Sridhar, 1980;Joshi, 1982).
Example (3) demonstrates an unnatural and uncommon code-mixed construction and thus we can conclude that the two mixing constraints hold true for Bengali-English CM text as well. We extend these constraints to question words which can fall in the POS category of ADV and PRON as well as for adpositions (prepositions and postpositions). We note that the example (4) results in an acceptable code-mixed sentence as the closed class elements from the matrix language Bengali are retained.

The Code-Mixing Process
The pipeline for our code-mixing script is as shown in Figure 2. The script takes shallow-parsed English and Bengali parallel corpora as inputs. Consistency across chunks in parallel sentences is imperative for direct replacement of chunks for code-mixing. However, there are various structural differences in constituency parsing obtained for English by the Stanford Parser (Klein and Manning, 2003)   1. Separate the coordinating conjunction and its conjuncts into different chunks as they are treated separately in Bengali.
3. Convert prepositional phrase (PP) to NP by making the head noun of the succeeding NP as the head and separating it from the preceding verb phrase (VP).
4. Split NP at genitives into separate NPs as genitives are considered as separate chunks in Bengali.
The rules are demonstrated by the example below: (5) (NP Your self-confidence) (ADVP also) (VP increases (PP with (NP teeth))) → (NP Your) (NP self-confidence also) (VP increases) (NP with teeth) which now consistently maps to the corresponding chunks in the parallel Bengali sentence: (6) (NP daanter "teeth" jonyo "for") (NP aapnaar "your") (NP aatmaviswas "self-confidence" o "also") (VP baadhe "increases") Along with harmonizing the chunks, this module marks the heads of each chunk in both languages using generalized rules defined by Sharma et al. (2006). For clarity, we have mapped the POS tags from Penn Treebank POS tagsets (Marcus et al., 1993) for English and Bureau Of Indian Standard (BIS) POS tagset (Choudhary and Jha, 2011) for Bengali to the Universal Dependency Tagset (Nivre et al., 2016). The second module in the pipeline facilitates rule-based chunk replacement by taking the chunkharmonized parallel Bengali and English sentences as inputs and replacing some selected Bengali chunks with English according to the rules discussed in 2.2. First, the chunks, each represented by the head element, are aligned using word alignments obtained from Giza++ (Och and Ney, 2003). Next, we replace the Bengali noun chunks (NP) and adjectival chunks (JJP) with the corresponding English chunks. By keeping the verbal chunks (VP) intact, we ensure that Bengali is retained as the matrix language of the code-mixed sentence. Hybrid compound verbs (see section 2.1) are a common occurrence in Bengali-English code-mixing and we can succesfully synthesize them by replacing the NP/JJP preceding Bengali light verbs. For eg: (JJP porishkaara ("clean")) (VP koruna ("do")) → (JJP clean) (VP koruna ("do")). We also retain Bengali post-positions and drop English prepositions associated with the heads.
Mixing the Bengali sentence (6) with the parallel English sentence (5) will generate: (7) (NP teeth er "of" jonyo "for" ) (NP aapnaar "your" ) (NP self-confidence o "also" ) (VP baadhe "increases" ) This is one of the acceptable combinations of the two sentences to form a CM sentence. We use the parallel corpora for English, Bengali and Hindi provided by Indian Languages Corpora Initiative (ILCI) (Jha, 2010) belonging to the health domain. We select a subset of 10,000 parallel sentences from each language and generate code-mixed sentences for both Bengali-English and Hindi-English language pair following the constraints in 2.2 . Thus, we have a parallel corpora for code-mixed Bengali-English and Hindi-English along with parallel corpora for Bengali, Hindi and English. We obtain only 5,063 codemixed sentences with a minimum CM ratio of 30:70(%). The reason for this is attributed to the nonalignment of a few heads in many Bengali and Hindi sentences to the heads of corresponding English sentence. In spite of strictly following these rules, we generated a few erroneous sentences with word repetitions due to inconsistent chunking of multi-word expressions. We try to mitigate those errors in the post-processing step by carefully removing repeated words at code-mixing points. We attain this by calculating cosine similarity between the words represented by their cross lingual embeddings (see section 4). Eg: chiniyukta ("sugared") sugared gums → sugared gum

Synthetic Bengali-English Treebank
Cross-lingual annotation projection makes use of parallel data to project annotations from the source language to the target language through automatic word alignment. Hwa et al. (2002) proposed some basic projection heuristics to deal with different kinds of word alignments. Tiedemann (2014) proposed improvements in the annotation scheme by adding heuristics to remove unnecessary dummy nodes that are introduced in the target treebank to deal with problematic word alignments. We investigate the utility of annotation projection from the Hindi-English CM treebank (HE) and the Bengali monolingual treebank (B) to Bengali-English (BE). HE is created by parsing the Hindi-English CM data generated in the section 2.2 using the neural stacking dependency parser for Hindi-English by . 5 BE is generated by parsing the parallel Bengali sentences using the same neural stacking dependency parser trained on a monolingual Bengali dependency treebank. The POS tagging and parsing accuracy of these two parsers are mentioned in Table 2.
The basic setup for annotation projection is as follows: 1. Project annotations from B to BE for the matching head word nodes in Bengali and its dependent Bengali nodes.
2. Project annotations from HE to BE for the matching head word nodes in English and its dependent English nodes.
3. For each matching English dependent node in HE and BE with a Hindi head, find the aligned Bengali node in B. If the cosine similarity between the two is above a certain threshold (0.5), project annotations from B to BE.
4. For each matching Bengali dependent node in B and BE with an English head, find the aligned Hindi node in HE. If the cosine similarity between the two is above a certain threshold (0.5), project annotations from HE to BE.
In Figure 3, we demonstrate this with an example where the annotation for the BE tree is generated by both HE (in blue) and B (in red). Since the sentences in BE, HE and B are essentially parallel, we get one-to-one mapping and do not need to introduce any dummy nodes. We select 3643 completely annotated trees for our Syn-BE.

Dependency Parsing
We adapt the neural dependency parser by  which is based on a transition-based parser (Kiperwasser and Goldberg, 2016) and enhanced by neural stacks to incorporate monolingual syntactic knowledge with the CM model. The model jointly learns POS-tagging as well as parsing by adapting feature level neural stacks (Zhang and Weiss, 2016;Chen et al., 2016

Experiments
Our models are trained on English and Hindi UD-v2 treebanks. 6 Due to the absence of a Bengali UD treebank, we converted the Paninian annotation scheme (Begum et al., 2008) present in the Bengali treebank 7 to UD by slightly modifying the rules (Tandon et al., 2016) for Hindi. The characters are represented by 32-dimensional character embeddings while the words in each language are represented by 64 dimensional word2vec vectors (Mikolov et al., 2013) learned using the skip-gram model. The hidden dimensions and learning hyperparameters are consistent with those in . For our baseline model, we train the neural stacking model  for Bengali-English by training the source model on both Bengali and English treebanks and stacking it on a CM model trained on 140 Bengali-English CM (Gold-BE) sentences in our training set. Even though the size of the training set is limited, we benefit from the presence of unique CM grammar as well as syntactic information of social media elements. Our bilingual source model serves to transfer both POS tagging and parsing information to the CM model.
In our next experiment, we train the CM stacking model with 1448 Hindi-English CM data (Gold-HE) as provided by  in addition to our 140 Gold-BE sentences. In order to fully capture the Hindi syntactic information in the CM data, we fortify the bilingual source model with the Hindi treebank resulting in a trilingual source model. We try to reduce the differences in data representations belonging to Hindi and Bengali by using: 1. Cross Lingual Word Embeddings for Hindi and Bengali by projecting the word2vec embeddings for the two languages into the same space by using the projection algorithm of Artetxe et al. (2016) and using a bilingual lexicon from ILCI parallel corpora.
2. WX notation 8 to represent words from the two languages and using a common 32-dimensional character embedding space.   For our final experiment, we augment our Synthetic Code-Mixed Bengali-English Treebank (Syn-BE) to the trilingual source model generated in the previous experiment and stack that on our CM model.

Results
We present our final results in  (1448). Moreover, the significantly lower parser accuracy (a difference of 9% LAS points) for Bengali in comparison to Hindi negatively impacts the performance of the source model (See Table 2).
Our next model that fortifies the baseline model with Hindi monolingual and CM data with Hindi-English improves all the three measurements significantly because it enables us to utilize the relatively large Hindi-English CM UD-annotated data. The UAS and LAS show an improvement in accuracy by 11.64% and 10.66% points respectively. The improvement in POS accuracy is~8%. In this model, we slightly modify the word and character embedding representations in order to mitigate the lexical differences between Hindi and Bengali by using cross-lingual embeddings and a common character space. From Table 3, we observe that using cross-lingual embeddings improves the accuracy of tagging by 0.76%, UAS by~0.6% points and LAS by~0.5% points. Using a common character space by using WX notation further improves the accuracy of both tagging and parsing by~1.8% and~2.5% points respectively. The significant improvements in the results confirm the inherent similarity between the code-mixing grammar of Hindi and Bengali with English as both of these language pairs deal with mixing of two typologically diverse languages.
Our final model utilizes our Syn-BE CM treebank by augmenting it to the trilingual source model and stacking it on the CM model trained on our Gold-HE and Gold-BE datasets. We observe an improvement in the Bengali-English parser accuracy by 1.82% UAS points, 1.37% LAS points and POS tagging accuracy by 2.2% . This improvement is satisfactory considering the errors propagated into our Syn-BE treebank by annotating projections from automatically parsed Bengali and Hindi-English treebanks. We must also note that the the domain of Syn-BE (health ) lacks certain social media elements and constructs present in the evaluation set.

Conclusion
Our neural stacking model utilizing monolingual, gold and synthetic CM resources has shown significant improvement of 10.24% for POS, 13.76% improvement in UAS and~12% improvement in LAS points when compared with the baseline model. The stacking model augmented by the Syn-BE CM treebank improves the POS tagging accuracy by 2.2% points and parser accuracy by 1.82% UAS points and 1.37% LAS points respectively. The Syn-BE CM data can be used in other NLP systems like machine translation, question-answering etc. to further improve their systems. There is scope for extending the Syn-BE corpus by including more CM constructions like intra-sentential switching and CM sentences with English as the matrix language. Our evaluation dataset consisting of 500 UD-annotated Bengali-English tweets provides for a valuable resource for research on code-mixing.