Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets

While growing code-mixed content on Online Social Networks(OSN) provides a fertile ground for studying various aspects of code-mixing, the lack of automated text analysis tools render such studies challenging. To meet this challenge, a family of tools for analyzing code-mixed data such as language identifiers, parts-of-speech (POS) taggers, chunkers have been developed. Named Entity Recognition (NER) is an important text analysis task which is not only informative by itself, but is also needed for downstream NLP tasks such as semantic role labeling. In this work, we present an exploration of automatic NER of code-mixed data. We compare our method with existing off-the-shelf NER tools for social media content,and find that our systems outperforms the best baseline by 33.18 % (F1 score).


Introduction
Code-switching or code-mixing occurs when "lexical items and grammatical features from two languages appear in one sentence" (Muysken, 2000). 1 It is frequently seen in multilingual communities and is of interest to linguists due to its complex relationship with societal factors (Sridhar and Sridhar, 1980).
With the rise of Web 2.0, the volume of text on online social networks (OSN) such as Twitter, Facebook, Reddit has grown.It is estimated that around 240 Million Indian users, alone, are active on Twitter2 .A significant fraction of these users are bilingual, or even trilingual, and their tweets can be monolingual in English or their vernacular, or code-mixed.Past research has looked at multiple dimensions of this behaviour such as its relationship to emotion expression (Rudra et al., 2016) and identity.Code-mixing or multilingualism of tweets poses a significant problem to both the OSNs' underlying text mining algorithms as well as researchers trying to study online discourse, since most existing tools for analyzing OSN text content caters to monolingual data.For example, Twitter's abuse detection systems fail to flag code-mixed tweets as offensive. 3Recent efforts to build tools for code-mixed content include language identifiers (Solorio and Liu, 2008), partsof-speech(POS) taggers (Vyas et al., 2014), and chunking (Sharma et al., 2016).A natural extension of these set of automated natural language processing (NLP) tools is a Named Entity Recognizer (NER) for code-mixed social media data, which we present in this paper.Additionally, as language tags are an essential feature for NLP tasks, including NER, we also present a neural network based language identifier.
Our main contributions are: 1. Building a token-level language identification system for Hindi-English (Hi-En) code mixed tweets, described in detail in Section 3.
2. Building an NER for En-Hi code-mixed tweets, which we explain in Section 4. We also show, in Section 5, that our NER performs better than existing baselines.

Related Work
In this section we briefly describe the approaches for automatic language identification and extrac-tion of named entities.Language Identification for code-mixed content has been previously explored in Barman et al. (2014).Particularly close to our work is the use of deep-learning approaches for detecting tokenlevel language tags for code-mixed content (Jaech et al., 2016).
We particularly focus on efforts to building NERs for social media content and, NERs for Indian languages and code-mixed corpora.Social Media text, including and especially tweets, have subtle variations from written and spoken text.These include slacker grammatical structure, spelling variations, ad-hoc abbreviations and more.See Ritter et al. (2011) for detailed differences between tweets and traditional textual sources.Monolingual NER for tweets include (Ritter et al., 2011;Li et al., 2012).We build on these approaches to account for code-mixing and use the former as a baseline to test our method against.

Language Identification using Transliteration
We build a token level language identification model (LIDF) for code-mixed tweets using monolingual corpora available for both the languages, supplemented by a small set of annotated codemixed tweets.We hypothesize that words or character sequences of different languages encode different structures.Therefore, we aim to capture this character structure for both languages.Subsequently, given a token, we try to see which language fits better.Our LIDF algorithm comprises of multiple steps, described in Figure 1.Each of the steps have been explained in detail below.

Roman-Devanagari transliteration
We restrict ourselves to instances of En-Hi codemixing where the Hindi component is written in Roman script. 4Therefore, any model trained on Hindi corpora (which will be in Devanagari) is not directly usable.To make use of such a corpus, we first transliterate words written in Roman to Devanagari script.
We train a model T , that takes a token written in Roman characters and generates its Devanagari equivalent (given token school, T generates ).We follow an approach used by (Rosca Figure 1: Different steps of our language identification pipeline.We extract features using language models trained on monolingual corpora, and train a classifier based on these features. and Breuel, 2016) for English-Arabic transliteration and, and train an Attention Sequence to Sequence (Seq2Seq) model (Bahdanau et al., 2014).
Given an input sequence of Roman characters, T generates a sequence of Devanagari characters.We train T using Roman-Devanagari transliteration pairs mined by (Gupta et al., 2012) and(Khapra et al., 2014).After combining the two sets and removing duplicates, we were left with 41,383 unique pairs.
We explore multiple Seq2Seq models, experimenting with different RNN cells, encoder and decoder depths, output activation functions and the number of RNN cells in each layer.We evaluate their performance using the character error rate metric, comparing with LITCM (Bhat et al., 2015) as a baseline.We report the performance of our top five models in Table 1, all of which perform better than the baseline.
Our best model comprises of a 2 layer bidirectional RNN encoder and a 2 layer RNN decoder, each layer comprising of 128 GRU units with ReLU activation.We use the Adagrad optimizer to train T , adding a dropout of 0.5 after each layer and use the early stopping technique to prevent over fitting.Larger and deeper networks perform relatively poorer since they start overfitting quickly.
We observe that Hindi words which have multiple spellings when written in Roman script are all transliterated to the same Hindi token.Therefore, T could also be used for normalization and we hope to investigate this in more detail in future.

Extracting features using monolingual corpora
For both languages, we learn the structure of the language at a character level.We do this by training a model which for a token of length n, learns to predict the last character given the first n − 1 characters as input.More formally, for each token {c 1 , c 2 , ..., c n }, the model learns to predict c n given the input sequence {c 1 , c 2 , c n−1 }.We model this as a sequence classification problem using LSTM RNNs (Hochreiter and Schmidhuber, 1997).We use ELM to refer to the English language model, and HLM to refer to the Hindi language model.
The same architecture is used for both ELM and HLM, as shown in Figure 1.Both comprise of two RNN layers with 128 LSTM cells each, using ReLU activation.The output of the second RNN layer at the last time step is connected to a fully connected (FC) layer with softmax activation.The size of this FC layer is equal to the character vocabulary V of the language.We take a softmax over the FC layer to predict c n .The normalized outputs from the FC layer can be thought of as a probability distribution over V , the i th normalized output being equal to P (V i |c 1 , c 2 , ..., c n−1 ), where V i is the i th character in the vocabulary.We refer to the normalized outputs from the FC layer of ELM and HLM as P ELM and P HLM respectively, and use them as features for our language detection classifier.
For training ELM, we use the News Crawl dataset provided as a part of the WMT 2014 trans-lation task. 5As a preprocessing step, we remove all non-alphabetic characters (such as numbers and punctuations), and convert all uppercase alphabets to lowercase.After preprocessing, we were left with a total of 98,565,179 tokens, 189,267 of which were unique.For HLM, we use the IIT Bombay Hindi corpus (Kunchukuttan et al., 2017).We follow the same preprocessing steps, except for converting to lowercase (since there is no concept of case in Hindi).This yielded 59,494,325 tokens, of which 161,020 were unique.
The input sequence is encoded into a sequence of one hot vectors before feeding it to the network.We use categorical cross-entropy as the loss function, optimizing the model using gradient descent (Adagrad).20% of the unique tokens are held out and used to validate the performance of our model.Once again, we use the early stopping technique and add a dropout of 0.5 after each layer to avoid over-fitting.

Predicting language tag using P ELM and P HLM
Given a word {c 1 , c 2 , ..., c n } we first transliterate it to Devanagari using T , generating {c 1 , c 2 , ..., c k }.Then by passing {c 1 , c 2 , ..., c n−1 } and {c 1 , c 2 , ..., c k−1 } through ELM and HLM respectively, we obtain P ELM and P HLM .Our hypothesis is that we can differentiate P ELM of a Hindi word from the P ELM of an English word, since the character sequence structure of a Hindi word is different from that of English words (which are used to train ELM).Similarly, we can differentiate P HLM of an English word from the P HLM of a Hindi word.We use a set of tweets curated by Sakshi Gupta and Radhika (2016) which are annotated for language at a token level (each token is either English, Hindi or Rest) to train a three class classifier using (i) P ELM (ii) P HLM , (iii) ratio of nonalphabetic characters in W , (iv) ratio of capitalization in W and (v) Binary feature indicating whether W is title-case as features.The last three features help identify the Rest tokens.Our 3-class classifier is a fully connected neural network with 2 hidden layers using ReLU activation.On 5fold cross validation, our model achieves an average F 1 score of 0.934 and an average accuracy of 0.961 across the three classes.This is a slight improvement over the model proposed by Sharma et al. (2016), which had an accuracy of 0.938 on the same dataset (as reported by reported by Sakshi Gupta and Radhika (2016)).

Named Entity Recognition
Named entity recognition typically comprises of two components, (i) entity segmentation and (ii) entity classification.Both these components can either be modeled separately as done by Ritter et al. (2011), or they can be combined and tackled together like the model proposed by Finkel et al. (2005).We adopt the latter approach, modeling both components together as a sequence labeling task.Our hypothesis is that named entities can be identified using features extracted from words surrounding it.We explore models using Conditional Random Fields and LSTM RNNs using handcrafted features described below.

Features
Our hand-crafted features are described below.
• Token based features : The current token T , T after stripping all characters which are not in the Roman alphabet (T clean ), and converting all characters in T clean to lowercase (T norm ) generates three different features.We create T wordhsape by replacing all uppercase letters in T with X, all lowercase letters with x, all numerals with o and leave all other characters as they are.For example, Adam123 becomes Xxxxooo.We also use token length T L as as feature.
• Affixes : Prefixes and suffixes of length 1 to 5 extracted from T , padded with whitespace if needed.These help in identifying phrases that are not entities.For example, an English token ending in ing is highly unlikely to be a named entity.
• Character based features : Binary features indicating (i) whether T is title case, (ii) whether T has an uppercase letter, (iii) whether T has all uppercase characters, (iv) whether T has a non-alphanumeric character and (v) whether T is a hashtag.We also calculate the fraction of characters in T which are ASCII • Language based features : The language which T belongs to, as predicted by the model proposed in section 3.For the LSTM model, we also use P ELM and P HLM generated for T .
• Syntactic features : POS tag for T as predicted by the model trained by Owoputi et al. (2013), POS tag and chunk tag for T and as predicted by the shallow parser trained by Sharma et al. (2016) • Tweet capitalization features : From the tweet that T belongs to, we extract (i) fraction of characters that are uppercase, (ii) fraction of characters that are lowercase, (iii) fraction of tokens that are title case.6

Proposed Models
Our LSTM model (Figure 3) comprises of two bidirectional RNN layers using LSTM cells and ReLU activation.The input at the time step t is F t , i.e. the feature vector for the token at position t in the tweet.F t is generated by concatenating the extracted features for the token at position t.layer F C t .We take a softmax over the output of F C t to predict the label (L t ) for the token at position t.While training, we use the Adagrad optimizer and add a dropout of 0.5 after each layer.Our CRF model is the standard generalized CRF proposed by Lafferty et al. (2001), which allows a flow of information across the sequence in both directions.We add L 1 and L 2 regularization to prevent over-fitting, and do an extensive grid search to come up with optimum values for these constants.

Data collection and annotation
We randomly sampled 50,000 tweets from the code mixed tweet dataset collected by (Patro et al., 2017).On this set, we ran our language detection algorithm and filtered tweets which had at least five Hi tokens.The filtered data also had tweets containing Roman tokens belonging to languages other than English and Hindi (like transliterated Telugu), such tweets were removed during the annotation process.
The final dataset comprised of 2079 tweets (35,374 tokens).13,860 (39.18 %) of the tokens were En, 11,391 (32.2 %) Hi and 10,123 (28.61 %) Rest.Each tweet was annotated at a token level for three classical named entity types P erson, Location and Organisation, using the IOB format.The annotation process was carried out by three linguists proficient in both English and Hindi.The final label was decided based on majority vote and any instance (around 2%) where all three annotators disagreed was resolved through discussion.In all, the annotators identified 2763 entity phrases (3751 tokens) which included 1,644 P erson entities, 744 Location entities and 375 Organisation entities.

Baselines and Results
We compare our proposed models (LSTM and CRF) with two baseline systems, (i) a state of the art English NER model proposed by Lample et al. (2016) and (ii) a state of the art NER model for tweets proposed by Ritter et al. (2011).Named Entities do not have belong to one particular language though the same NE might have different forms in different languages, such as Cologne in English, is Kln in German.For the purpose of this study, we do not assign NEs any language tags and leave the detection and mapping of multiple NE forms as future work.
The results are summarized in Table 2 (segmentation) and Table 3 (segmentation and classification).All metrics are calculated on a phrase level, no partial credit is awarded.An incorrectly identified boundary is penalized as both, a false positive and a false negative.For computing Table 2, we generalize entity tags (B-PER, B-LOC, B-ORG become B-ENT, I-PER, I-LOC, I-ORG become I-ENT).As expected, both these systems fare poorly on our data at both entity segmentation and entity classification.We believe this is due to the high number of out of vocabulary tokens (belonging to Hindi) in the data.

Discussion
In this paper, we present a Named Entity Recognition tool specifically targeting Hindi-English code-mixed content.To build our NER model, we also present a unique semi-supervised language identifier which exploits the character-level differences in languages.We validate the performance of our NER against off-the-shelf NER for Twitter and observe that our model outperforms them.
In future, we plan to explore building other downstream NLP tools such as Semantic Role Labeling or Entity-specific Sentiment Analyzers which make use of NER for code-mixed data.

Figure 2 :
Figure 2: Architecture for ELM and HLM.The figure shows one cell in each LSTM layer unrolled over time.

Figure 3 :
Figure 3: LSTM NER architecture.The figure shows one cell in each LSTM layer unrolled over time

Table 1 :
Performance of different transliteration models.Reported CER is in percentage and was calculated after a 5-fold cross validation.Depth was kept the same for both encoding and decoding layers.

Table 2 :
T , T clean , T norm , T wordhsape and affixes are passed through embedding layers which are randomly initialized and learnt during the training phase.All real-valued features are encoded into one-hot vectors of length 10, using percentile binning.The output (h t ) of the second RNN layer at each time step is passed to a separate fully connected Performance (F 1 scores) of different models on segmentation and classification.N is the total number of entities in the entire dataset.Reported numbers are in percentages.Results of CRF and LSTM are on 5-fold cross validation.

Table 3 :
Performance of different models at entity segmentation.All numbers are in percentages.