Uncovering Code-Mixed Challenges: A Framework for Linguistically Driven Question Generation and Neural Based Question Answering

Existing research on question answering (QA) and comprehension reading (RC) are mainly focused on the resource-rich language like English. In recent times, the rapid growth of multi-lingual web content has posed several challenges to the existing QA systems. Code-mixing is one such challenge that makes the task more complex. In this paper, we propose a linguistically motivated technique for code-mixed question generation (CMQG) and a neural network based architecture for code-mixed question answering (CMQA). For evaluation, we manually create the code-mixed questions for Hindi-English language pair. In order to show the effectiveness of our neural network based CMQA technique, we utilize two benchmark datasets, SQuAD and MMQA. Experiments show that our proposed model achieves encouraging performance on CMQG and CMQA.


Introduction
The people who are multilingual in nature often switch back and forth between their native languages and the foreign (popular) languages to express themselves on the web. This is very common nowadays, particularly when people express their opinions (or making any communication) through various social media platforms. This phenomenon of embedding the morphemes, words, phrases, etc. of one language into another is popularly termed as code-mixing (CM) (Myers-Scotton, 1997, 2002. The recent study (Safran, 2015) has uncovered that users frequently use question patterns, namely 'how' (38%), 'why' (24%), 'where' (15%), 'what' (11%), and 'which' (12%) in their queries as opposed to a 'statement query'. * Work carried out during the internship at IIT Patna Presently, the search engines have become intelligent and are capable enough to provide precise answer to a natural language query/question 1 .Several virtual assistants such as Siri, Cortana, Alexa, Google Assistant, etc are also equipped with these facilities. However, these search engines and virtual assistants are efficient only in handling the queries written in English. Let us consider the following two representations (English and code-mixed) of the same question. (i) Q: "Who is the foreign secretary of USA?" (ii) Q: "USA ke foreign secretary koun hai?" (Trans:"Who is the foreign secretary of USA?") Search engines are able to provide the exact answer to the first question. It is to be noted that although both the questions are same, the search engine is unable to return the exact answer for the second question, which is code-mixed in nature. It rather returns the top-most relevant web pages.
In this paper, we propose a framework for codemixed question generation (CMQG) as well as code-mixed question answering (CMQA) involving English and Hindi. Firstly, we propose a linguistically motivated technique for generating the code-mixed questions. We followed this approach as we did not have access to any labeled data for code-mixed question generation. Thereafter, we propose an effective framework based on deep neural network for Code-mixed Question Answering (CMQA). In our proposed CMQA technique, we use multiple attention based recurrent units to represent the code-mixed questions and the English passages. Finally, our answer-type focused network (attentive towards the answer-type of the question being asked) extracts the answer for a given code-mixed question. We summarize our contributions as follows: (i). We propose a linguistically motivated unsuper-vised algorithm for Code-mixed Question Generation (CMQG). (ii). We propose a bilinear attention and answer-type focused neural framework to deal with CMQA. (iii). We create two CMQA datasets to further explore the research on CMQA. In addition to this, we manually create a code-mixed question dataset, and subsequently a code-mixed question classification dataset. (iv) We provide a stateof-the-art setup to extract answers from the English passages for the corresponding code-mixed questions. The source code of our proposed systems and the datasets can be found here 2 .

Related Work
Code-mixing refers to the mixing of more than one language in the same sentence. Creating resources and tools capable of handling code-mixed languages is more challenging in comparison to the traditional language processing activities that are concerned with only one language. In recent times, researchers have started investigating methods for creating tools and resources for various Natural Language Processing (NLP) applications involving code-mixed languages. Some of the applications include language identification (Chittaranjan et al., 2014;Barman et al., 2014), partof-speech (PoS) tagging Jamatia et al., 2015;Gupta et al., 2017), question classification (Raghavi et al., 2015), entity extraction (Gupta et al., 2018a(Gupta et al., , 2016b, sentiment analysis (Rudra et al., 2016;Gupta et al., 2016a) etc. Developing QA system in a code-mixed scenario is, itself, very novel in the sense that there have not been very significant attempts towards this direction, except the few such as (Chandu et al., 2017). Our literature survey shows that the existing methods of question generation (general) include both rules Heilman and Smith (2010); Ali et al. (2010) and machine learning (Serban et al., 2016;Wang et al., 2017a) techniques. A joint model of question generation and answering based on sequenceto-sequence neural network model is proposed in (Wang et al., 2017a).
In recent times, there have been several studies on deep learning based reading comprehension/ QA (Hermann et al., 2015;Cui et al., 2017;Shen et al., 2017;Wang et al., 2017b;Gupta et al., 2018c;Wang and Jiang, 2016;Berant et al., 2014;Maitra et al., 2018;Cheng et al., 2016;Trischler et al., 2016). To the best of our knowledge, this is the very first attempt to automatically generate the code-mixed questions (i.e. question generation), as well as provide a robust solution by developing an end-to-end neural network model for CMQA.

Code-Mixed Question Generation
We focus on a code-mixed scenario involving two languages, viz. English and Hindi. Due to the scarcity of labeled data, we could not employ any sophisticated machine learning technique for question generation. Rather, we propose an unsupervised algorithm that automatically formulates the code-mixed questions. The algorithm makes use of several NLP components such as PoS tagger, transliteration and lexical translation. We construct Hindi-English code-mixed question from a given Hindi question. Let us consider the following three questions: All the three questions are same but are asked in English, Hindi and the code-mixed English-Hindi languages. It can be seen that Q 2 and Q 3 are similar and share many false cognates (Moss, 1992) [(Seattle, िसएटल), (naam, नाम), (kya, ा), (mai, म ), (baseball, बे सबॉल)]. The question Q 3 has the direct transliteration of the Hindi words (िसएटल → Seattle), ( नाम → naam ), ( ा → kya), ( म → mai) and (बे सबॉल → baseball). There are some words (e.g. 'team') in Q 3 which are the English lexical translations from Hindi. We perform a thorough study of Hindi sentences and their corresponding code-mixed Hindi-English sentences, and observe the following: (1) Named entities (NEs) of type person ( The underlined words in the Hindi sentence have noun (NN) PoS tags, therefore the corresponding words are replaced with their best lexical translations in their respective code-mixed sentence.
(4) The remaining words in the Hindi sentence are transliterated (in English) and a code-mixed EN-HI sentence is formed. For example, the remaining words of the previous Hindi sentence are transliterated (in underlines) and the code-mixed EN-HI sentence is formed.

Code-Mixed (EN-HI): Individuals ko unki creativity ke liye koun se rights diye jate hain?
The main challenge of automatic CMQG is to find the best lexical translation which suits the most in the given context of the particular question. Let us consider the various lexical translation choices tr i = {tr (i, 1) , tr (i, 2) , . . . , tr (i, l i ) } for the token (t i ), where l i is the number of lexical translations available for the token t i . The lexical translation disambiguation algorithm selects the most probable lexical translation of token t i from a set of l i possible translations. We generate a query by adding the previous token t i−1 and the next token t i+1 with the token of interest designated by t i . The context within a query provides important clues for choosing the right transliteration of a given query word. For example, for a query S ={शहर, पू व , ःकॉटल ड} (Trans: {city, east, Scotland}), where the word 'पू व ' is the word in interest for which the most probable lexical translation needs to be identified from the list {BC, East}.
Here, based on the context, we can see that the choice of translation for the word 'पू व ' is 'east' since the combinations {city, east} and {east, Scot-land} are more likely to co-occur in the corpus than {city, BC} and {BC, Scotland}. We follow the iterative disambiguation algorithm (Monz and Dorr, 2005) which judges a pair of items to gather partial evidences for the likelihood of a translation in a given context. An occurrence graph is constructed using the query term S and the translation set T R, such that the translation candidates of different query terms are connected with the associated Dice Co-efficient weight between them. At the same time, it is also ensured that there should not be any edge between the different translation candidates of the same query term. We initialize each translation candidate with equal likelihood of a translation. After initialization, the weight of each translation candidate is iteratively updated using the weights of the translation candidate linked to it and the weight of the link connecting them. At the end of the iteration the weight of each translation candidate is normalized to ensure that these all sum up to 1.

Proposed Approach for CMQA
Given a code-mixed question Q with tokens {q 1 , q 2 . . . , q m } and an English passage P having tokens {p 1 , p 2 . . . , p n }, where m and n are the number of tokens in the question and the passage, respectively. The task is to identify answer A with Each model component is described below:

Token and Sequence Encoding
From the given code-mixed question Q and passage P , we first obtain the respective tokenlevel embeddings {x Q t } m t=1 and {x P t } n t=1 from the pre-trained word embedding matrix. Due to the code-mixed nature, our model faces the out-ofvocabulary (OOV) word issue. To tackle this, we adopt character-level embedding to represent each token of the question and passage. These are denoted by {c Q t } m t=1 and {c P t } n t=1 for question and passage, respectively. The character-level embeddings are generated by taking the final hidden states of a bi-directional gated recurrent units (Bi-GRU) (Chung et al., 2014) applied to the character embedding of the tokens. The final representations of each token u Q t and u P t of question and passage, respectively, are obtained through the Bi-GRU as follows: where, ⊕ is the concatenation operator. In order to encode the token sequence, we apply convolution followed by Bi-GRU operation as follows: First, the convolution operation is performed on the zero-padded sequenceū P over the passage sequence u P , whereū P t ∈ R d . A set of k filters F ∈ R k×l×d , is applied to the sequence. We obtain the convoluted features c P t at given time t for t = 1, 2, . . . , n by the following formula.
The feature vectorC P = [c P 1 ,c P 2 . . .c P n ] is generated by applying the max pooling on each element c P t of C P . This sequence of convolution feature vectorC P is passed through a Bi-GRU network. The same convolution operations are also performed over the question sequence u Q and the convolution feature vectorC Q is obtained. Similar to e.q. 1, we compute Bi . We represent the question and passage representation matrix by V Q ∈ R m×h and V P ∈ R n×h , respectively, where h is the number of hidden units of the Bi-GRUs.

Question-aware Passage Encoding
When a single passage contains the answer of two or more than two different questions then the passage encoding obtained from the previous layer (c.f. section 4.1) will not be effective enough to provide the answer of each question. It is because the obtained passage encoding does not take into account the question information. In this layer first we compute an attention matrix M ∈ R n×m as follows: M i, j is the similarity score between the i th element of the passage encoding V P and j th element of the question encoding V Q . The dist(x, y) function is an euclidean distance 3 between x and y.
Thereafter, the normalization of element M i,j of matrix M is performed with respect to the i t h row.
Intuitively, it calculates the relevance of a word in the given passage with each word in the question. We compute the question vector Q ∈ R n×h corresponding to all the words in the passage as Q = M × V Q . Each row t of the question vector Q denotes the encoding of the passage word t with respect to all the words in the question. The question aware passage encoding will be computed by the word-level concatenation of the passage encoding v P t and question vector of the t th row Q t . More formally, the question aware passage encoding a t of the word at time t will be a t = v P t ⊕Q t . Finally, we apply a Bi-GRU to encode the question aware information over time. It is computed as follows: We can represent the question aware passage encoding matrix as S ∈ R n×h .

Bilinear Attention on Passage
Question aware passage encoding accounts the relevance of the words in a question with the given passage. If the answer spans more than one token (i.e. a multi-word tokens), it is important to compute the relevance between the constituents of the multi-word tokens. We calculate the bilinear attention matrix B ∈ R n×n on question aware passage encoding S ∈ R n×h as follows: where, W b ∈ R h×h is a bilinear weight matrix. Similar to e.q. 4, normalization is performed on B, and the normalized attention matrix is denoted as B. The element B i,j is the measure of relevance between the i th and j th words of the passage. Similar to the question vector Q, we calculate the passage vector R ∈ R n×h as computed on R = B×S. The concatenation (word wise) of question dependent passage encoding vector s t and passage vector r t is performed to obtain R t and form the matrix R ∈ R n×2h . Similar to Wang et al. (2017b), we introduce a gating mechanism to control the impact of R and denote it as the G ∈ R n∈2h . In order to identify the start and end indices of the answer from the passage, we employ two Bi-GRU with input as G. Similar to e.q. 1, output of the Bi-GRUs is computed as P s ∈ R n×h and P e ∈ R n×h .

Answer-type Focused Answer Extraction
The answer-type of a question provides the clues to detect the correct answer from the passage. .) The answer-type of the question Q is 'person'. Even though the network has the capacity to capture this information up to a certain degree, it would be better if the model takes into account this information in advance while selecting the answer span. Li and Roth (2002) proposed a hierarchical question classification based on the answer-type of a question. Based on the coarse and fine classes of Li and Roth (2002), we train two separate answer-type detection networks on the Text REtrieval Conference (TREC) question classification dataset 4 . First, we translate 5 5952 TREC English questions into Hindi and thereafter transform the Hindi questions into the code-mixed questions by using our proposed CMQG algorithm. We train the answer-type detection network with code-mixed questions and their associated labels using the technique as discussed in (Gupta et al., 2018b). The network learns the encoding of coarse (C at ∈ R h ) and fine class (F at ∈ R h ) of answer-types obtained from the answer-type detection network. The attention matrix M calculated in e.q. 3 undergoes the maxpooling over the columns to capture the most relevant parts of the question.
The max-pooled representation of question and answer-type representation are concatenated in the following way: A feed-forward neural network with tanh activation function is used to obtain the final output Q f ∈ R h . The probability distribution of the beginning of answer A s and the end of answer A e is computed as: To train the network, we minimize the sum of the negative log probabilities of the ground truth start and end position by the predicted probability distributions.

Datasets and Experiments
In this section, we report the datasets and the experimental setups.

Datasets (CMQG)
For CMQG task, we require the input question to be in Hindi. We use the manually created Hindi questions obtained from the Hindi-English question answering dataset (Gupta et al., 2018b). We generate the code-mixed questions by our proposed approach (c.f. Section 3). In order to evaluate the performance of our proposed CMQG algorithm, we also manually formulate 6 the Hindi-English code-mixed questions. Details of this dataset are shown in Table 1. We compute the complexity of code-mixing using the metric, Code-mixing Index (CMI) score (Gambäck and Das, 2014). We name this code-mixed question dataset as 'HinglishQue'. We observe that our HinglishQue dataset has higher CMI score as compared to the FIRE 7 2015 (CMI=11.65) and ICON 8 2015 (5.73) CM corpus (Soumil Mandal and Das, 2018) 9 . This implies that our HinglishQue dataset is more complex and challenging in comparison to the other Hindi-English code mixing (CM) dataset. The CMI score of the system generated codemixed questions is 37.22.

Datasets (CMQA)
(1) CM-SQuAD: We generate the CMQA dataset from the portion of SQuAD (Rajpurkar et al., 2016)   questions into Hindi and use our approach of CMQG (c.f. Section 3) to transform the Hindi questions into the code-mixed questions. We manually verify the questions to ensure the quality. We use the corresponding English passage to find the answer pair of the code-mixed question. Detailed statistics of the dataset are shown in Table 2. We randomly split the dataset into training, development and test set.
(2) CM-MMQA: We experiment with a recently released multilingual QA dataset (Gupta et al., 2018b). It provides Hindi questions along with their English passages. Similar to the CM-SQuAD dataset, we create code-mixed questions and their respective answer pairs. Details of the dataset are shown in Table 2.

Experimental Setup for CMQG
The tokenization and PoS tagging are performed using the publicly available Hindi Shallow Parser 10 . The Polyglot 11 Named Entity Recognizer (NER) (Al-Rfou et al., 2015) is used for named entity recognition. The lexical translation set is obtained by the lexical translation table generated as an intermediate output of Statistical Machine Translation (SMT) training by Moses (Koehn et al., 2007) on publicly available 12 English-Hindi (EN-HI) parallel corpus (Bojar et al., 2014). We aggregate the output probability p(e|h) and inverse probability p(h|e) along with their associated words in both English (e) and Hindi (h) languages. We choose a threshold (5) to filter out the least probable translations. The co-occurrence weight (Dice Co-efficient) is calculated on the available 13 n-gram dataset consisting of unique 2, 86, 358 bigrams and 3, 33, 333 unigrams. For Devanagari (Hindi) to Roman (English) transliteration, we use the transliteration system 14 based on Ekbal et al. (2006). We evaluate the performance of CMQG in terms of accuracy, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) score.

Experimental Setup for CMQA
CMQA datasets contain the words both in Roman script and English. For English, we use the fast-Text (Bojanowski et al., 2016) word embedding of dimension 300. We use the Hindi sentences from Bojar et al. (2014), and then transliterate it into the Roman script. These sentences are used to train the word embeddings of dimension 300 by the word embedding algorithm (Bojanowski et al., 2016). Finally, we align monolingual vectors of English and Roman words in an unified vector space using a linear transformation matrix learned by the approach as discussed in Smith et al. (2017). Other optimal hyper-parameters are set to: character embedding dimension=50, GRU hidden unit size=150, CNN filter size=150, filter size=3, 4, batch size=60, # of epochs=100, initial learning rate=0.001. Optimal values of the hyperparameters are decided based on the model performance on the development set of CM-SQuAD dataset. Adam optimizer (Kingma and Ba, 2014) is used to optimize the weights during training. For the evaluation of CMQA, we adopt the exact match (EM) and F1-score (Rajpurkar et al., 2016).

Baselines (CMQG)
We portray the problem of code-mixed question generation with respect to sequence to sequence learning where the input sequence comprises of Hindi question and the output sequence is the codemixed EN-HI question. A seq2seq with attention (Sutskever et al., 2014;Bahdanau et al., 2014) network is trained using the default parameters of Nematus (Sennrich et al., 2017). The training dataset of the pair of Hindi translated question and codemixed questions from CM-SQuAD dataset (c.f. Section 5.2) is used for training the seq2seq network. We evaluate the network on the manually created CMQG dataset (c.f. Section 5.1).

3) BiDAF (Seo et al., 2016):
This is another state-of-the-art neural model for RC. We trained this model with the same hyperparameters as given in (Seo et al., 2016).

Results and Analysis
We demonstrate the evaluation results of our proposed CMQG algorithm on the HinglishQue dataset in Table 4. For evaluation, we employed three annotators who were instructed to assign the label (same or different) depending upon whether the system generated and manually created questions are similar or dissimilar. The agreement among the annotators was calculated by Cohen's Kappa (Cohen, 1960)    shows that our proposed CMQG algorithm performs better than the seq2seq based baseline. One reason could be the insufficient amount (16,632) of training instances and the out-of-vocabulary (only 62.35% words available in the vocab) issue. Performance improvement in our proposed model over the baseline is statistically significant as p < 0.05. In literature, we find only one study on English-Hindi code-mixed question classification i.e. Raghavi et al. (2015). They used only 1, 000 code-mixed questions, and used Support Vector Machine (SVM) to classify the questions into coarse and fine-grained answer-types. They reported to achieve 63% and 45% accuracies for coarse and fine-grained answer-type detection, respectively under 5-fold cross validation setup. In contrast, we manually create 5, 535 codemixed questions and train a CNN model that shows 87.21% and 83.56% accuracies for coarse and fine answer types, respectively, for the 5-fold cross validation.
Results of CMQA for both the datasets are shown in Table 3. Performance of IR based baseline (Chandu et al., 2017) on both the datasets are poor. This may be because Chandu et al. (2017)'s system was mainly developed to answer pure factoid questions based only on the named entities denoting person, location and organization. However, the datasets used in this experiment have different types of answers beyond the basic factoid questions. We also perform a cross-domain experiment, where the test data of CM-MMQA is used to evaluate the system trained on CM-SQuAD. Performance improvements in our proposed model over the baselines are statistically significant as p < 0.05. Experiments show that the performance of CM-MMQA is better than CM-SQuAD. This might be due to the relatively smaller length passages in CM-MMQA, extracting answers from which are easier.
We perform ablation study to observe the effects of various components of the CMQA model. Results are shown in Table 5. The component convolution refers to the convolution operation per-  formed before the Bi-GRUs in sequence encoding.

Error Analysis
We analyze the errors encountered by our CMQG and CMQA systems. The CMQG algorithm uses several NLP components such as PoS tagger, NE tagger, translation, transliteration etc. The errors occurred in these components propagate towards the final question generation. We list some of the major causes of errors with examples in Table 6. As in (1), the algorithm could not find the correct lexical translation from the lexical table itself and therefore selected an irrelevant word 'Czechoslovakia' instead of 'occupy'. In (2) and (4), the algorithm picked the words 'died' and 'population' instead of 'death' and 'demographics', respectively. It could be because the word 'died' and 'population' have higher n-gram frequencies compared to the words 'death' and 'demographics' in the ngram corpus. In (3), the system generated incorrect word ('imef') instead of 'IMF'. Here, the Hindi word 'आईएमएफ' is incorrectly tagged as 'Other' instead of 'Organization'. Thereafter, the transliteration system provides an incorrect transliteration ('imef') of the abbreviated Hindi word 'आईएमएफ' (Trans: IMF). We observe that sometimes our CMQA system incorrectly predicts the answer words which are actually very close to some other word in the shared embedding space ( (c.f. section 5.4), and hence gets high attention score in the bilinear attention module. For example, in this passage '...India was ruled by the Bharata clan and ...', the system predicted the answer 'India' instead of 'Bharata' (reference answer) because the word 'Bharata' is the transliteration form of भारत and भारत is the correct translation form of the word 'India'.
Our close analysis to the prediction of CM-SQuAD and CM-MMQA development data reveals that the systems suffer mostly due to the errors where the answer strings are relatively longer. The CM-MMQA dataset has some definitional questions (requires at least one-sentence long answer). We evaluate the performance on CM-MMQA dataset after removing those questions (92), and obtain the EM and F1 scores of 40.50% and 53.73%, respectively. These are much higher (28.14%, 46.25%) than the model where all the questions are considered. Due to ambiguity in selecting answers (between two candidate answers, location type answer) the system sometimes predicts incorrectly. We also observed some other types of errors which were mainly due to the context mismatch as well as long-distance dependence between the answer and the context words.

Conclusion
In this work, we have proposed a linguistically motivated unsupervised algorithm for CMQG and a neural framework for CMQA. We have proposed a bilinear attention and answer-type focused neural framework to deal with CMQA. We have evaluated the performance of CMQG on manually created code-mixed questions involving English and Hindi. For CMQA, we have created two CMQA datasets. Experiments show that our proposed models attain state-of-the-art performance. In the future, we would like to scale our work for other code-mixed languages.