Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text

In this study, the problem of shallow parsing of Hindi-English code-mixed social media text (CSMT) has been addressed. We have annotated the data, developed a language identiﬁer, a normalizer, a part-of-speech tag-ger and a shallow parser. To the best of our knowledge, we are the ﬁrst to attempt shallow parsing on CSMT. The pipeline developed has been made available to the research community with the goal of enabling better text analysis of Hindi English CSMT. The pipeline is accessible at 1 .


Introduction
Multilingual speakers tend to exhibit code-mixing and code-switching in their use of language on social media platforms.Code-Mixing is the embedding of linguistic units such as phrases, words or morphemes of one language into an utterance of another language whereas code-switching refers to the co-occurrence of speech extracts belonging to two different grammatical systems (Gumperz., 1982).
Here we use code-mixing to refer to both the scenarios.
Hindi-English bilingual speakers produce huge amounts of CSMT.Vyas et al. (2014) noted that the complexity in analyzing CSMT stems from nonadherence to a formal grammar, spelling variations, lack of annotated data, inherent conversational nature of the text and of course, code-mixing.Therefore, there is a need to create datasets and Natural 1 http://bit.ly/csmt-parser-apiLanguage Processing (NLP) tools for CSMT as traditional tools are ill-equipped for it.Taking a step in this direction, we describe the shallow parsing pipeline built during this study.
2 Background Bali et al. (2014) gathered data from Facebook generated by English-Hindi bilingual users which on analysis, showed a significant amount of codemixing.Barman et al. (2014) investigated language identification at word level on Bengali-Hindi-English CSMT.They annotated a corpus with more than 180,000 tokens and achieved an accuracy of 95.76% using statistical models with monolingual dictionaries.Solorio and Liu (2008) experimented with POS tagging for English-Spanish Code-Switched discourse by using pre-existing taggers for both languages and achieved an accuracy of 93.48%.However, the data used was manually transcribed and thus lacked the problems added by CSMT.Vyas et al. (2014) formalized the problem, reported challenges in processing Hindi-English CSMT and performed initial experiments on POS tagging.Their POS tagger accuracy fell by 14% to 65% without using gold language labels and normalization.Thus, language identification and normalization are critical for POS tagging (Vyas et al., 2014), which in turn is critical further down the pipeline for shallow parsing as evident in Table 5. Jamatia et al. (2015) also built a POS tagger for Hindi-English CSMT using Random Forests on 2,583 utterances with gold language labels and achieved an accuracy of 79.8%.In the monolin-

Data Preparation
CSMT was obtained from social media posts from the data shared for Subtask 1 of FIRE-2014 Shared Task on Transliterated Search.The existing annotation on the FIRE dataset was removed, posts were broken down into sentences and 858 of those sentences were randomly selected for manual annotation.
Table 1 and Table 2 show the distribution of the dataset at sentence and token level respectively.The language of 63.33% of the tokens in code-mixed sentences is Hindi.Based on the distribution, it is reasonable to assume that Hindi is the matrix language (Azuma, 1993;Myers-Scotton, 1997)  The dataset is comprised of sentences similar to example 1 and 2. Example 1 shows codeswitching as the language switches from English to Hindi whereas example 2 shows codemixing as some English words are embedded in a Hindi utterance.Spelling variations (sm some, govgovernment), ambiguous words (To -So in Hindi or To in English) and non-adherence to a formal grammar (out of place ellipsis -..., no or misplaced punctuation) are some of the challenges evident in analyzing the examples above.

Annotation
Annotation was done on the following four layers: 1. Language Identification: Every word was given a tag out of three 'en', 'hi' and 'rest' to mark its language.Words that a bilingual speaker could identify as belonging to either Hindi or English were marked as 'hi' or 'en'.The label 'rest' was given to symbols, emoticons, punctuation, named entities, acronyms, foreign words and words with sub-lexical codemixing like chapattis (Gloss: chapattibread) which is a Hindi word (chapatti) following English morphology (plural marker -s).
2. Normalization: Words with language tag 'hi' in Roman script were labeled with their standard form in the native script of Hindi, Devanagari.Similarly, words with language tag 'en' were labeled with their standard spelling.Words with language tag 'rest' were kept as they are.This acted as testing data for our Normalization module.
3. Parts-of-Speech (POS): Universal POS tagset (Petrov et al., 2011) was used to label the POS of each word as this tagset is applicable to both English and Hindi words.Sub-lexical codemixed words were annotated based on their context, since POS is a function of a word in a given context.For example, an English verb used as a noun in Hindi context is labeled as a noun.et al., 2006).Unlike AnnCorra, only one tag is used for all verb chunks in our tagset.Chunk boundary is marked using BI notation where 'B-' prefix indicates beginning of a chunk and 'I-' prefix indicates that the word is inside a chunk.
This whole dataset was annotated by eight Hindi-English bilingual speakers.Two other annotators reviewed and cleaned it.To measure interannotator agreement, another annotator read the guidelines and annotated 25 sentences (334 tokens) from scratch.The inter-annotator agreement calculated using Cohen's κ (Cohen, 1960) came out to be 0.97, 0.83 and 0.89 for language identification, POS tagging and shallow parsing respectively.

Shallow Parsing Pipeline
Shallow parsing is the task of identifying and segmenting text into syntactically correlated word groups (Abney, 1992;Harris, 1957).Shallow parsing is a viable alternative to full parsing as shown by (Li and Roth, 2001).Our shallow parsing pipeline is composed of four main modules, as shown in Figure 1.These modules, in the order of their usage, are Language Identification, Normalization, POS Tagger and Shallow Parser.
Our pipeline takes a raw utterance in Roman script as input on which each module runs sequentially.Twokenizer 2 (Owoputi et al., 2013)   performs well on Hindi-English CSMT (Jamatia et al., 2015) was used to tokenize the utterance into words.The Language Identification module assigns each token a language label.Based on the language label assigned, the Normalizer runs the Hindi normalizer or the English/Rest normalizer.The POS tagger uses the output of the normalizer to assign each word a POS tag.Finally, the Shallow Parser assigns a chunk label with boundary.
The functionality and performance of each module is described in greater detail in the following subsections.

Language Identification
While language identification at the document level is a well-established task (McNamee, 2005), identifying language in social media posts has certain challenges associated to it.Spelling errors, phonetic typing, use of transliterated alphabets and abbreviations combined with code-mixing make this problem interesting.Similar to (Barman et al., 2014), we performed two experiments treating language identification as a three class ('hi', 'en', 'rest') classification problem.The feature set comprised of -BNC: normalized frequency of the word in British National Corpus (BNC)3 .LEXNORM: binary feature indicating presence of the word in the lexical normalization dataset released by Han et al. (2011).HINDI DICT: binary feature indicating presence of the word in a dictionary of 30,823 transliterated Hindi words as released by Gupta (2012).NGRAM: word n-grams.AFFIXES: prefixes and suffixes of the word.
Using these features and introducing a contextwindow of n-words, we trained a linear SVM.In another experiment we modeled language identification as a sequence labeling task, where we employed CRF into usage.The idea behind this was that code-mixed text has some inherent structure which is largely dictated by the matrix language of the text.The latter approach using CRF had a greater accuracy, which validated our hypothesis.The results of this module are shown in Table 3.

Normalization
Once the language identification task was complete, there was a need to convert the noisy non-standard tokens (such as Hindi words inconsistently written in many ways using the Roman script) in the text into standard words.To fix this, a normalization module that performs language-specific transformations, yielding the correct spelling for a given word was built.Two language specific normalizers, one for Hindi and other for English/Rest, had two subnormalizers each, as described below.Both subnormalizers generated normalized candidates which were then ranked, as explained later in this subsection.
1. Noisy Channel Framework: A generative model was trained to produce noisy (unnormalized) tokens from a given normalized word.
Using the model's confidence score and the probability of the normalized word in the background corpus, n-best normalizations were chosen.First, we obtained character alignments between noisy Hindi words in Roman script (H r ) to normalized Hindi wordsformat(H w ) using GIZA++ (Och and Ney, 2003) on 30,823 Hindi word pairs of the form (H w -H r ) (Gupta et al., 2012).Next, a CRF classifier was trained over these alignments, enabling it to convert a character sequence from Roman to Devanagari using learnt letter transformations.Using this model, noisy H r words were created for H w words obtained from a dictionary of 1,17,789 Hindi words (Biemann et al., 2007).Finally, using the formula below, we computed the most probable H w for a given H r .The candidates obtained from these two systems are ranked on the basis of the observed precision of the systems.The top-k candidates from each system are selected if they have a confidence score greater than an empirically observed Λ.A similar approach was used for English text normalization, using the English normalization pairs from (Han et al., 2012) and (Liu et al., 2012) for the noisy channel framework, and Aspell5 as the spell-checker.Words with language tag 'rest' were left unprocessed.The accuracy for the Hindi Normalizer was 78.25%, and for the English Normalizer was 69.98%.The overall accuracy of this module is 74.48%;P@n (Precision@n) for n=3 is 77.51% and for n=5 is 81.76%.

Part-Of-Speech Tagging
Part-of-Speech (POS) tagging provides basic level of syntactic analysis for a given word or sentence.It was modeled as a sequence labeling task using CRF.The feature set comprised of -Baseline: Word based features -affixes, context and the word itself.LANG: Language label of the token.NORM: Normalized lexical features.TPOS: Output of Twitter POS tagger (Owoputi et al., 2013) 5.

Pipeline Results
The best performing model was selected from each module and was used in the pipeline.Table 6 tabulates the step by step accuracy of the pipeline calculated using 10 fold cross-validation.

Conclusion and Future Work
In this study, we have developed a system for Hindi-English CSMT data that can identify the language of the words, normalize them to their standard forms, assign them their POS tag and segment them into chunks.We have released the system.
In the future, we intend to continue creating more annotated code-mixed social media data.We would also like to improve upon the challenging problem of normalization of monolingual social Hindi sentences.Also, we would further extend our pipeline and build a full parser which has aplenty applications in NLP.

Figure 1 :
Figure 1: Schematic Diagram of the Pipeline

Table 6 :
Pipeline accuracy and error propagation.LI = Language Identification, Norm = Normalizer, POS = POS Tagger, SP = Shallow Parser, L = Label, B = Boundary, C = Combined, P1 = Actual Pipeline, P2 = Gold Pipeline, E = Error Propagation 4.4 Shallow Parsing A chunk comprises of two aspects -the chunk boundary and the chunk label.Shallow Parsing was modeled as three separate sequence labeling problems: Label, Boundary and Combined, for each of which a CRF model was trained.The feature set comprised of -POS: POS tag of the word.POS Context: POS tags in the context window of length 5, i.e., the two previous tags, current tag and next two tags.POS LEX: A special feature made up of concatenation of POS and LEX.NORMLEX: The word in its normalized form.The results of this module are shown in Table

Table 1 :
Data distribution at sentence level.

Table 2 :
Data distribution at token level.
Gloss: Hey... try for some government job which forms give out... Translation: Hey... try for some government job which gives out forms... 2. To tum divya bharti mandir marriage kendra ko donate karna Gloss: So you divya bharti temple marriage center to donate do Translation: So you donate to divya bharti temple marriage center which 2 http://www.ark.cs.cmu.edu/TweetNLP/

Table 3 :
Feature Ablation for Language Identifier argmax Hw i p(H w i |H r ) = argmax Hw i p(H r |H w i )p(H w i )where p(H w i ) is the probability of word H w i in the background corpus.
. HPOS: Output of IIIT's Hindi POS tagger 6 .COMBINED: HPOS for Hindi words and TPOS for English and Rest.The results of POS Tagger are shown in Table4.

Table 5 :
Feature Ablation for Shallow Parser