Named Entity Recognition for Hindi-English Code-Mixed Social Media Text

Named Entity Recognition (NER) is a major task in the field of Natural Language Processing (NLP), and also is a sub-task of Information Extraction. The challenge of NER for tweets lie in the insufficient information available in a tweet. There has been a significant amount of work done related to entity extraction, but only for resource rich languages and domains such as newswire. Entity extraction is, in general, a challenging task for such an informal text, and code-mixed text further complicates the process with it’s unstructured and incomplete information. We propose experiments with different machine learning classification algorithms with word, character and lexical features. The algorithms we experimented with are Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF). In this paper, we present a corpus for NER in Hindi-English Code-Mixed along with extensive experiments on our machine learning models which achieved the best f1-score of 0.95 with both CRF and LSTM.


Introduction
Multilingual speakers often switch back and forth between languages when speaking or writing, mostly in informal settings. This language interchange involves complex grammar, and the terms "code-switching" and "code-mixing" are used to describe it Lipski. Code-mixing refers to the use of linguistic units from different languages in a single utterance or sentence, whereas codeswitching refers to the co-occurrence of speech extracts belonging to two different grammatical sys-tems Gumperz. As both phenomena are frequently observed on social media platforms in similar contexts, we use only the code-mixing scenario in this work.
Following are some instances from a Twitter corpus of Hindi-English code-mixed texts also transliterated in English.
T1 : "Finally India away series jeetne mein successful ho hi gayi :D" Translation: "Finally India got success in winning the away series :D" T2 : "This is a big surprise that Rahul Gandhi congress ke naye president hain." Translation: "This is a big surprise that Rahul Gandhi is the new president of Congress." However, before delving further into codemixed data, it is important to first address the complications in social media data itself. First, the shortness of micro-blogs makes them hard to interpret. Consequently, ambiguity is a major problem since semantic annotation methods cannot easily make use of co-reference information. Second, micro-texts exhibit much more language variation, tend to be less grammatical than longer posts, contain unorthodox capitalization, and make frequent use of emoticons, abbreviations and hashtags, which can form an important part of the meaning. Most of the research has, however been focused on resource rich languages, such as English Sarkar, GermanTjong Kim Sang and De Meulder, French Azpeitia et al. and  The structure of the paper is as follows. In Section 2, we review related research in the area of Named Entity Extraction on code-mixed social media texts. In Section 3, we describe the corpus creation and annotation scheme. In Section 4, we discuss the data statistics. In Section 5, we summarize our classification systems which includes the pre-processing steps and construction of feature vector. In Section 6, we present the results of experiments conducted using various character, word level and lexical features using different machine learning models. In the last section, we conclude our paper, followed by future work and the references.
2 Background and Related work Bali et al. performed analysis of data from Facebook posts generated by English-Hindi bilingual users. Analysis depicted that significant amount of code-mixing was present in the posts. Vyas et al. formalized the problem, created a POS tag annotated Hindi-English code-mixed corpus and reported the challenges and problems in the Hindi-English code-mixed text. They also performed experiments on language identification, transliteration, normalization and POS tagging of the Dataset. Sharma et al. addressed the problem of shallow parsing of Hindi-English code-mixed social media text and developed a system for Hindi-English code-mixed text that can identify the language of the words, normalize them to their standard forms, assign them their POS tag and segment into chunks. Barman

Corpus and Annotation
The corpus that we created for Hindi-English code-mixed tweets contains tweets from last 8 years on topics like politics, social events, sports, etc. from the Indian subcontinent perspective. The tweets were scrapped from Twitter using the Twitter Python API 3 which uses the advanced search option of twitter. The mining of the tweets are done using some specific hash-tags and are mined in a json format which consist all the information regarding the tweets like time-stamps, URL, text, user, replies, etc. Extensive pre-processing (Section 5.4) was carried out to remove the noisy and non-useful tweets. Noisy tweets are the ones which comprise only of hashtags or urls. Also, tweets in which languages other than Hindi or English are used were also considered as noisy and hence removed from the corpus . Furthermore, all the tweets which were either in only English or used Devanagari script text are removed too, keeping only the code-mixed tweets. Further cleaning of data is done in the annotation phase.

Annotation: Named Entity Tagging
We label the tags with the present three Named Entity tags 'Person', 'Organization', 'Location', which using the BIO standard become six NE tags (B-Tag referring to beginning of a named entity and I-Tag refers to the intermediate of the entity) along with the 'Other' tag to all those which don't lie in any of the six NE tags.
'Per' tag refers to the 'Person' entity which is the name of a Person, twitter handles and common nick names of people. The 'B-Per' states the beginning and 'I-Per' for the name of the Person, if the Person name or reference is split into multiple continuous. In the example T3 we show the instance of 'Per' tag in a tweet chosen from our corpus. T3: "modi/B-Per ji/I-Per na/Other kya/Other de/Other rakha/Other hai/Other media/B-Org ko/Other ?/Other" Translation: "What has modi ji given to media?"

T5:
"saare/Other black/Other money/Other to/Other swiss/B-Org bank/I-Org mein/Other the/Other" Translation: "all of the black money was in the swiss bank" With these six NE tags and the seventh 7th tag as "Other" we annotated 3,638 tweets which meant tagging 68,506 tokens. The annotated Dataset with the classification system is made available online. 4 The distribution of the tags in the Dataset is shown in Table 1.  Table 2: Inter Annotator Agreement.

Inter Annotator Agreement
Annotation of the Dataset for NE tags in the tweets was carried out by two human annotators having linguistic background and proficiency in both Hindi and English. In order to validate the quality of annotation, we calculated the inter annotator agreement (IAA) between the two annotation sets of 3,638 code-mixed tweets having 68,506 tokens using Cohen's Kappa coefficient Hallgren. Table  2 shows the results of agreement analysis. We find that the agreement is significantly high. Furthermore, the agreement of 'I-Per' and 'I-Org' annotation is relatively lower than that of 'I-Loc', this is because, the presence of uncommon/confusing names of Organization as well as Person with unclear context.

Data statistics
Using the twitter API we retrieved 1,10,231 tweets. After manually filtering as described in Section 3, we are left with 3,638 code-mixed tweets. This number is close to the size of Dataset provided by FIRE 2016 which introduced the NER task for code-mixed text in 2015 with it's one shared task on Entity recognition on code-mixed data. Table 1 shows the distribution of different tags in the corpus. We use the standard CONLL tags (Loc, Org, Per, Other) for tagging in the annotation stage. The Named Entity (NE) Tag Person ("Per"), Organization ("Org") and Location ("Loc") are the ones we used to tag our corpus tokens. The 'Person' tag comprises names of famous people, politicians, actresses, sports personalities, news reporters and social media celebrities and their twitter handles and nick names if used frequently as known to the annotator (like "Pappu for Mr. Rahul Gandhi"). 'Organizations' comprises names of social or political organizations as well as major groups present in India, eg. Bhartiya Janta Party (BJP), Hindu, Muslim, twit- ter, etc. The Tag 'Location' comprises names of cities, towns, states and countries of the world. Major of the location entities are present for Indian subcontinent places in the corpus. The ones which does not lie in any of the mentioned tags are assigned 'Other' tag.

System Architecture
In this section we'll explain working of different machine learning algorithms we used for experiments on our annotated dataset.

Decision Tree
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too. Szarvas et al. takes a multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithm. The decision tree algorithm tries to solve the problem, by using tree representation. Each internal node of the tree corresponds to an attribute, and each leaf node corresponds to a class label. In decision trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with record's attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node. The primary challenge in the decision tree implementation is to identify which attributes do we need to consider as the root node and each level. Handling this is know the attributes selection. We have different attributes selection measure to identify the attribute which can be considered as the root note at each level. The popular attribute selection measures: • Information gain

• Gini index
Information gain: Using information gain as a criterion, we try to estimate the information contained by each attribute. By calculating entropy measure of each attribute we can calculate their information gain. Information Gain calculates the expected reduction in entropy due to sorting on the attribute. Information gain can be calculated as: Where p(x) is the probability of a class for the feature we are calculating information gain. The node/feature with lowest entropy is chosen as root and process is repeated for other level feature selection.
Gini index: It refers to a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with lower gini index should be preferred. It is calculated as: Where p j is the probability of a class for a given feature we are calculating gini index for.

Conditional Random Field (CRF)
For sequence labeling (or general structured prediction) tasks, it is beneficial to consider the correlations between labels in neighborhoods and jointly decode the best chain of labels for a given input sentence. For example, in POS tagging an adjective is more likely to be followed by a noun than a verb, and in NER with standard BIO2 annotation (Tjong Kim Sang and Veenstra, 1999) I-ORG cannot follow I-PER. Therefore, we model label sequence jointly using a conditional random field (CRF) (Lafferty et al., 2001), instead of decoding each label independently. Since here we are focusing on sentence level and not individual positions hence it is generally known that CRF can produce higher tagging accuracy.
Say we are given a sequence of inputs we denote by X where X = (x 1 , x 2 , x 3 , . . . , x m ) which are nothing but the words of the sentence and S = (s 1 , s 2 , s 3 , . . . , s m ) as the sequence of output states, i.e the named entity tags. In conditional random field we model the conditional probability as p(s 1 , s 2 , . . . , s m |x 1 , x 2 , . . . , x m ) where s ranges over all possible input sequences. For the estimation of w, we assume that we have a set of n labelled examples (x i , s i ) n i=1 . Now we define regularized log likelihood function L as The terms λ 2 2 ||w|| 2 2 and λ 1 ||w|| 1 force the parameter vector to be small in the respective norm. This penalizes the model complexity and is known as regularization. The parameters λ 2 and λ 1 allows to enforce more or less regularization. The parameter vector w * is then estimated as If we estimated the vector w * , we can find the most likely tag for a sentence s * for a given sentence sequence x by s * = argmax s p(s|x; w * )

LSTMs
Recurrent neural networks (RNN) are a family of neural networks that operate on sequential data. They take an input sequence of vectors (x 1 , x 2 , . . . , x n ) ad return another sequence (h 1 , h 2 , . . . , h n ) that represents some information about the sequence at every step of the input. In theory RNNs can learn long dependencies but in practice they fail to do so and tend to be biased towards the most recent input in the sequence.Bengio et al. Long Short Term Memory networks usually just called "LSTMs" are a special kind of RNN, capable of learning long-term dependencies. Here with our data where tweets are not very long in the size LSTMs can provide us a better result as keeping previous contexts is one of the specialty of LSTM networks. LSTM networks were first introduced by Hochreiter and Schmidhuber and then were refined and popularized by many other authors. They work well with large variety of problems specially the one consisting of sequence and are now widely used. They do so using several gates that control the proportion of the input to give to the memory cell, and the proportion from the previous state to forget.

Pre-processing
This step is done to make the data uniform which will be beneficial for our system. The preprocessing step consist of

Features
The feature set consists of word, character and lexical level information like char N-Grams of Gram size 2 and 3 for suffixes, patterns for punctuation, emoticons, numbers, numbers inside strings, social media specific characters like '#', '@' and also previous tag information, and the same all  2. Word N-Grams: Bag of word features have been widely used for NER tasks in languages other than English Jahangir et al.. Thus we use word N-Grams, where we used the previous and the next word as a feature vector to train our model. These are also called contextual features.
3. Capitalization: It is a very general trend of writing any language in Roman script that people write the names of person, place or a things starting with capital letter von Däniken and Cieliebak or for aggression on someone/something use the capitalization of the entire entity name. This will make for two binary feature vectors one for starting with capital and other for the entire word capitalized.

Mentions and Hashtags:
It is observed that in twitter users generally tend to address other people or organization with their user names which starts with '@' and to emphasize on something or to make something notable they use '#' before that word. Hence presence of these two gives a good probability for the word being a named entity.

Numbers in String:
In social media content, users often express legitimate vocabulary words in alphanumeric form for saving typing effort, to shorten message length, or to express their style. Examples include abbreviated words like gr8' ('great'), 'b4' ('before'), etc. We observed by analyzing the corpus that alphanumeric words generally are not NEs. Therefore, this feature serves as a good indicator to recognize negative examples.
6. Previous Word Tag: As mentioned in word N-Gram feature the context helps in deciding the tag for the current word, hence the previous tag will help in learning the tag of current word and all the I-Tags always come after the B-Tags.
7. Common Symbols: It is observed that currency symbols as well as brackets like ' (', '[', etc. symbols in general are followed by numbers or some mention not of importance.
Hence are a good indicator for the words following or before to not being an NE.

Experiments
This section present the experiments we performed with different combinations of features and systems.

Feature and parameter experiments
In order to determine the effect of each feature and parameter of different models we performed several experiments with some set of feature vectors at a time and all at a time simultaneously changing the values of the parameters of our models like criterion ('Information gain', 'gini'), maximum depth of the tree for Decision tree model, optimization algorithms, loss functions in LSTM, regularization parameters and algorithms of optimization for CRF like 'L-BFGS' 5 , 'L2 regularization' 6 , 'Avg. Perceptron', etc. In all the models we mentioned above we validated our classification models with 5-fold cross-validation. Tables 3 shows the experiment result on Decision tree model with maximum depth = 32 which  we arrived at after fine empirical tuning. Tables  4 and 5 provides the experiments on CRF model. The c1 and c2 parameters for CRF model refers to L1 regularization and L2 regularization. These two regularization parameters are used to restrict our estimation of w * as mentioned in Section 5.1. When experimented with algorithm of optimization as 'L2 regularization' or 'Average Perceptron' there is not any significant change in the results of our observation both in the per class statistics as well as the overall. We arrived at these values of c1 and c2 after fine empirical tuning. Table 4 and 5 refers to this observation.
Next we move to our experiments with LSTM model. Here we experimented with the optimizer, activation function along with the number of units as well as number of epochs. The best result that we came through was with using 'softmax' as activation function, 'adam' as optimizer and 'sparse categorical cross-entropy' for our loss function. Table 7 shows the statistics of running LSTM on our Dataset with 5-fold cross-validation having validation-split of 0.2 with our char, word and lexical feature set of our tokens. Table 6 shows one prediction instance of our LSTM model.

Results and Discussion
From the above results we can say that our system learns from the structure of the text the specific NE types like from Table 6 we can see that our system is understanding well as it tagged most tokens correctly.
We also observe that our system is getting confused in the 'Org' names that resemble to name of locations like 'America' is tagged as 'B-Org' this is because our system has seen many 'American' tokens tagged as 'B-Org' hence this confusion.
From the example in the Table 11 we can see that our system learns to tag tokens that starts with '#' as beginning of a NE but majority of the time tags it as 'B-Per' which is a problem. Our model needs to learn more generic details about these specific characters.
For 'Loc' and 'Other' tags our system works good, giving accurate predictions. The presence of confusing names of 'Location', 'Organization' and that of 'Person' in our corpus makes it diffi-   cult for our machine learning models to learn the proper tags of these names. For eg. 'Hindustan' is labeled as 'B-Loc' in our annotation and 'Hindustani' is as 'B-Org' as the former is one of the 5 names of the country India and the later represent the citizens which makes it a group representation which we used for Organization during our annotation. Hence lexically similar words with different tags makes the learning phase of our model difficult and hence some incorrect tagging of the tokens as we can see in Table 6.

Conclusion and future work
In this paper, we present a freely available corpus of Hindi-English code-mixed text, consisting of tweet ids and the corresponding annotations. We also present NER systems on this Dataset with experimental analysis and results. This paper first explains about the reason of selection of some features specific to this task at the same time experimenting our results on different machine learning classification models. Decsion Tree, CRF and LSTM models worked with a best individual f1score of 0.94, 0.95 and 0.95 which is good looking at the fact that there haven't been much research in this domain.
To make the predictions and models' result more significant the size of the corpus needed to be expanded. Our corpus has just 3,638 tweets, but due to unavailability of Hindi-English Code-Mixed Dataset it is difficult to get large corpus for our system.
Our contribution in this paper includes the following points: 1. Annotated corpus for Hindi-English Code-Mixed, kind of which are not available anywhere on the internet.

Introduction an addressing of Hindi-English
Code-Mixed data as a research problem.
3. Proposal of suitable features targeted towards this task.
4. Different models which deals with sequential tagging and multi-class classification.
5. Developing machine learning models on our annotated corpus for the NE task.
As a part of future work, the corpus can be annotated with part-of-speech tags at word level which may yield better results. Moreover, the Dataset contains very limited tweets having NE tokens. Thus it can be extended to include more tweets more of these specific NE tokens as well as introducing a more number of tags on the existing corpus. The annotations and experiments described in this paper can also be carried out for code-mixed texts containing more than two languages from multilingual societies in future.