Transliteration Better than Translation? Answering Code-mixed Questions over a Knowledge Base

Humans can learn multiple languages. If they know a fact in one language, they can answer a question in another language they understand. They can also answer Code-mix (CM) questions: questions which contain both languages. This behavior is attributed to the unique learning ability of humans. Our task aims to study if machines can achieve this. We demonstrate how effectively a machine can answer CM questions. In this work, we adopt a two phase approach: candidate generation and candidate re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. We show experiments on the SimpleQuestions dataset. Our network is trained only on English questions provided in this dataset and noisy Hindi translations of these questions and can answer English-Hindi CM questions effectively without the need of translation into English. Back-transliterated CM questions outperform their lexical and sentence level translated counterparts by 5% & 35% in accuracy respectively, highlighting the efficacy of our approach in a resource constrained setting.


Introduction
Question Answering (QA) has received significant attention in the Natural Language (NLP) community. There are many variations (opendomain, knowledge bases, reading comprehension) as well as datasets Hopkins et al., 2017;Rajpurkar et al., 2016;Bordes et al., 2015) for the question answering task. However, many approaches (Lukovnikov et al., 2017;Yin et al., 2016;Fader et al., 2014;Chen et al., 2017a;Hermann et al., 2015) attempted in QA so far have been focused on monolingual questions. This is true for both methods and techniques as well as resources.
Code-mixing (referred to as CM) refers to the phenomenon of "embedding of linguistic units such as phrases, words and morphemes of one language into an utterance of another language" (Myers-Scotton, 2002). People in multilingual societies commonly use code-mixed sentences in conversations (Grover et al., 2017), to search on the web (Wang and Komlodi, 2016) and to ask questions (Raghavi et al., 2017). However, current Question Answering (QA) systems do not support CM and are only designed to work with a single language. This limitation makes it unsuitable for multilingual users to naturally interact with the QA system, specifically in scenarios wherein they do not know the right word in the target language.
CM presents serious challenges for the language processing community (Çetinoğlu et al., 2016;Vyas et al., 2014), including parsing, Machine Translation (MT), automatic speech recognition (ASR), information retrieval (IR) and extraction (IE), and semantic processing. Even for problems such as language identification, or part of speech tagging, that are considered solved for monolingual languages, performance degrades when mixed-language is present. Lack of language resources such as annotated corpora, part-of-speech taggers and parsers poses a considerable challenge for automated processing and analysis of CM languages. This further amplifies the challenge for CM QA. This CM question answering task is challenging not just because of having multiple languages with different semantics but also because of the different word order of source language and CM, making it difficult to extract essential features from the input text.
We base our work on the premise that humans can answer CM questions easily provided they understand the languages used in the question. They require no additional training in the form of CM questions to comprehend a CM question. So, one way to tackle CM questions is to translate them into a single language and use monolingual QA systems (Lukovnikov et al., 2017;Yin et al., 2016;Fader et al., 2014). Machine Translation systems perform poorly on CM sentences. The only other viable option is lexical translation (word by word translation). Lexical translation requires language identification, which Bhat et al. (2018) show to be solved. We show that our model trained on both English and Hindi can perform better on CM question directly than its lexical translation. This removes the need to obtain a large bilingual mapping of words for lexical translation. Also, such a sizeable bilingual mapping may be hard to obtain for low-resource languages.
Knowledge Bases (KBs) like Freebase (Google, 2017) and DBpedia 1 contain a vast wealth of information. Information is structured in the form of tuples, i.e. a combination of subject, predicate and object (s, p, o) in these KBs. Such KBs contain information predominately in English, and low resource languages tend to lose out on having a rich information source.
We use bilingual embeddings to fill the gaps due to lack of resources. We also develop a K-Nearest Bilingual Embedding Transformation (KNBET) which exploits bilingual embeddings to outperform the performance of lexical translation.
We overcome challenges discussed above in our paper and develop a CM QA system over KB, named CMQA, using only monolingual data from individual languages. We demonstrate our system with Hinglish (Matrix language: Hindi, Embedded language: English) CM questions. Our evaluation shows promising results given that no CM data was used to 1 http://dbpedia.org/ train our model. This shows promise that we do not need CM data but can use monolingual data to train a CM QA system. Our results show that our system is much more useful as compared to translating a CM question.
Our contributions are as follows: 1. We show how we can answer CM questions given an English corpus, noisy Hindi supervision and imperfect bilingual embeddings.
2. We introduce a Triplet-Siamese-Hybrid Convolutional Neural Network (TSHCNN) that jointly learns to rank candidate answers.
3. We provide a test dataset of 250 Hindi-English CM questions to researchers. This dataset is mapped with Freebase tuples and English questions from the Sim-pleQuestions dataset.
To the best of our knowledge, we are the first to tackle the problem of End-to-End Code-Mixed Question Answering over Knowledge Bases in a resource-constrained setting. Earlier approaches for CM QA (Raghavi et al., 2017) require a bilingual dictionary to translate words to English and an existing Google like-search engine to get answers, which we do not require.
The rest of the paper is structured as follows: We survey related work in Section 2 and describe the task description in Section 3. We explain our system in Section 4. We describe experiments in Section 5 and provide a detailed analysis and discussion in Section 6 and conclude in Section 7.

Question Answering and Knowledge
Bases Question answering is a well studied problem over knowledge bases (KBs) (Lukovnikov et al., 2017;Yin et al., 2016;Fader et al., 2014) and in open domain (Chen et al., 2017a;Hermann et al., 2015). Learning to rank approaches have also been applied to QA successfully (Agarwal et al., 2012;Bordes et al., 2014). Many earlier works (Ture and Jojic, 2017;Yu et al., 2017;Yin et al., 2016) which tackle SimpleQuestions divide the task into two steps: mention detection and relation prediction, whereas we jointly do both using our model. Lukovnikov et al. (2017) is more similar to our approach wherein they train a neural network in an end-to-end manner.

CodeMixing and CodeSwitching
Codemixing and code-switching has recently gathered much attention from researchers (Bhat et al., 2018;Rijhwani et al., 2017;Raghavi et al., 2015Raghavi et al., , 2017Banerjee et al., 2016;Dey and Fung, 2014;Bhat et al., 2017). CM research is mostly confined towards developing parsers and other language pipeline primitives (Bhat et al., 2018(Bhat et al., , 2017. There has been some work in CM sentiment analysis (Joshi et al., 2016). Raghavi et al. (2015) demonstrate question type classification for CM questions and Raghavi et al. (2017) also demonstrate a CM factoid QA system that searches for the lexically translated CM question using Google Search on a small dataset of 100 CM questions. To the best of our knowledge, there has been no work on building an end-to-end CM QA system over a KB.

Bilingual Embeddings Recent work has
shown that it is possible to obtain bilingual embeddings using only a minimal set of parallel lexicons (Smith et al., 2017;Artetxe et al., 2017;Ammar et al., 2016;Luong et al., 2015;P et al., 2014) or without any parallel lexicons (Zhang et al., 2017;Conneau et al., 2017). Our approach, can use these bilingual embeddings and supervised corpus for a resource-rich language, to enable CM applications for resourcepoor languages.

Cross-lingual Question Answering
Closely related is the problem of cross-lingual QA. There have been various approaches (Ahn et al., 2004;Lin and Kuo, 2010;Ren et al., 2010;Ture and Boschee, 2016) to cross-lingual QA. Some approaches (Lin and Kuo, 2010) rely on translating the entire question. Others (Ren et al., 2010), have also explored using lexical translations for this task. Recently, Ture et al. (Ture and Boschee, 2016) proposed models that combine different translation settings. There have been some efforts (Pouran Ben Veyseh, 2016;Hakimov et al., 2017;Chen et al., 2017b) to attempt cross-lingual question answering over knowledge bases.

Task Description
The SimpleQuestions task presented by Bordes et al. (2015) can be defined as follows.
represented as a set of tuples, where s i represents a subject entity, p i a predicate (also referred as relation), and o i an object entity. The task of SimpleQuestions is then: Given a question represented as a sentence, i.e. a sequence of words q = {w 1 , ..., w n }, find a tuple {ŝ,p,ô} ∈ K such thatô is the correct answer for question q. This task can be reformulated to finding the correct subjectŝ and predicatê p that question q refers to and which characterise the set of triples in K that contains the answer to q.
Consider the example, given question "Which city in Canada did Ian Tyson originated from?", the Freebase subject entity m.041ftf representing the Canadian artist Ian Tyson and the relation fb:music/artist/origin, can answer it.

Our System: CMQA
In this section, we describe our system which consists of two components: (1) the Candidate Generation module for finding relevant candidates and (2) a Candidate Re-ranking model, for getting the top answer from the list of candidate answers.

Candidate Generation
Any freebase tuple (specifically, the object in a tuple is the answer to the question) can be an answer to our question. We use an efficient (non-deep learning) candidate retrieval system to narrow down our search space and focus on re-ranking only the most relevant candidates. Solr 2 is an open-source implementation of an inverted index search system. We use Solr to index all our freebase tuples (FB2M) and then query for the top-k relevant candidates given the question as a query. We use BM25 as the scoring metric to rank results. Since we index freebase tuples which are in English (translating the entire KB would require a very large amount of effort and we restrict ourselves to using only the provided English KB), any non-English word in the query does not contribute to the matching. This is a limiting factor in candidate generation for CM questions.

Candidate Re-ranking
We use Convolutional Neural Networks (CNNs) to learn the semantic representation for input text (Kim, 2014;Hu et al., 2015;Lai et al., 2015;Cho et al., 2014;Johnson and Zhang, 2015;. CNNs learn globally word order invariant features and at the same time pick order in short phrases. This ability of CNNs is important since different languages 3 have different word orders.
Retrieving a semantically similar answer to a given question can be modelled as a classification problem with a large number of classes.
Here, each answer is a potential class and the number of questions per class is small (Could be zero, one or more than one. Since we match only the subject and predicate, there could be multiple questions having a common subject and predicate combination). An intuitive approach to tackle this problem would be to learn a similarity metric between the question to be classified and the set of answers. We find Siamese networks have shown promising results in such distance-based learning methods (Bromley et al., 1993;Chopra et al., 2005;Das et al., 2016).
Our Candidate Re-ranking module is in-spired by the success of neural models in various image and text tasks (Vo and Hays, 2016;Das et al., 2016). Our network is a Triplet-Siamese Hybrid Convolutional neural network (TSHCNN), see figure 1. Vo and Hays (2016) show that classification-siamese hybrid and triplet networks work well on image similarity tasks. Our hybrid model can jointly extract and exchange information from the question and tuple inputs.
All convolution layers share weights in TSHCNN. The fully connected layers are also Siamese and share weights. This weight sharing helps project both questions and tuples into a similar semantic space and reduces the required number of parameters to be learned.
Additional Input: Concatenate question + tuple Our initial network only had two inputs (question and tuple) to each corresponding branch. We further modify our network to provide a third input in the form of the concatenation of question and tuple. This additional input helps our network learn much better feature representations. We discuss this in the results section.
As shown in figure 1, questions and candidate tuples are provided to our system. Our experiments vary in the input questions (English and CM variations of questions), but the candidates (tuples or answers) are always in monolingual English. Thus our final answer is always in English.

K-Nearest Bilingual Embedding Transformation (KNBET)
The standard approach given bilingual (say English-French) embeddings (Plank, 2017;Da San Martino et al., 2017;Klementiev et al., 2012) has been to use the English word vector corresponding to the English word and the French word vector for the French word. Also, the network is trained only on the English corpora, i.e. trained using English word vectors only. When the input is say, a French sentence, they use French word vectors. Bilingual embeddings try and project both the English and French word vectors in the same semantic space, but these vectors are not perfectly aligned and might lead to errors in the networks' prediction. We propose to obtain the average of the nearest k-english-word-vectors for the given french word and use it as the embedding for the French word. For k=1, this reduces to a bilingual lexical dictionary using bilingual embeddings (Vulic and Moens, 2015;Madhyastha and España-Bonet, 2017). Since the bilingual embeddings are not perfectly aligned, Smith et al. (2017) show 4 that precision@k increases as k increases (e.g. for Hindi P@1 is 0.39, P@3 is 0.58 and P@10 is 0.63), when we obtain French (or any other language) translations for an English word. Thus, we conduct experiments with varying values of k and report the best results for the optimal k. Our experiments confirm the efficacy of KNBET. Further, we believe this KNBET can be used to improve the performance of any multilingual system that uses bilingual embeddings.

Loss Function
We use the distance based logistic triplet loss (Vo and Hays, 2016) which gave better results than a contrastive loss (Bordes et al., 2014). This has also been reported by Vo and Hays (2016) to exhibit better performance in image similarity tasks as well. Here, S pos and S neg are the similarity scores obtained by the ques-tion+positive tuple and question+negative tuple respectively.

Dataset
We use the SimpleQuestions (Bordes et al., 2015) dataset which comprises 75.9k/10.8k/21.7k training/validation/test questions. Each question is associated with an answer, i.e. a tuple (subject, predicate, object) from a Freebase (Google, 2017) subset (FB2M or FB5M). The subject is given as a MID 5 and we obtain its corresponding entity name by processing the Freebase data dumps. We were unable to obtain entity name mappings for some MIDs, and these were removed from our final set. We also obtain Hindi translations for all questions in SimpleQuestions using Google Translate. Note, these translations are not perfect and serve as a noisy input to the network. Also, we only translate the questions, and the answers remain in English. As with previous work, we show results over the 2M-subset of Freebase (FB2M).
We use pre-trained word embeddings 6 provided by Fasttext (Bojanowski et al., 2016) and use alignment matrices 7 provided by Smith et al. (2017) to obtain English-Hindi bilingual embeddings. Smith et al. (2017) use a small set of 5000 words to obtain the alignment matrices. The provided Hindi embeddings are in Devanagari script. We use randomly initialised embeddings between [-0.25, 0.25] for words without embeddings.
We have prepared a dataset of Hindi-English CM questions for a smaller set of 250 tuples obtained from the test split of Simple-Questions dataset. We gathered these questions from Hindi-English speakers, who were asked to form a natural language CM question, shown a tuple. Further, for every tuple we obtained CM questions from 5 different annotators and pick one at random for the final test set, to ensure multiple variations. Each CM question is in Roman script, and annotators anglicise (or transliterate) Hindi words (Devanagari script) to Roman script, to the best of their ability. This introduces variations in spellings and posses a challenge for the network and also back-transliteration. 5 A unique ID referring to an entity in Freebase. 6 https://fasttext.cc/ 7 https://goo.gl/Lwgu1D

Generating negative samples
We generate 10 negative samples for each training sample. We follow Bordes et al. (2014) to generate 5 negative samples. These candidates are samples picked at random and then corrupted following Bordes et al. (2014). We further use 5 more negative samples obtained by querying the Solr index. This gives us negative samples which are very similar to the actual answer and further the discriminatory ability of our network. This second policy is unique, and our experiments show that it gives us better performance.

Evaluation and Baselines
We report results using the standard evaluation criteria (Bordes et al., 2015), in terms of path-level accuracy, which is the percentage of questions for which the top-ranked candidate fact is correct. A prediction is correct if the system correctly retrieved the subject and the relationship.
Since there is no earlier work on CM QA over KBs, we compare the different ways a CM question can be answered using our QA sys- We also report results for the English questions in SimpleQuestions on our model trained only on English. This serves as a benchmark for our model as compared to other work on SimpleQuestions (Ture and Jojic, 2017;Yu et al., 2017;Yin et al., 2016;Lukovnikov et al., 2017;Golub and He, 2016;Bordes et al., 2015).
Network parameters and decisions are presented in Table 1. We train our model until the validation loss on the validation set stops improving further for 3 epochs. We report the results on the epoch with the best validation loss. We use K = 200 for the initial candidate generation step.

Quantitative Analysis
In Table 2, we present end-to-end results using our CMQA system. It shows competi- In Table 3, we report candidate generation results. We obtain candidates for each CM question variation using the question itself as a query. Further, cm-tl has words in Devanagari script which do not contribute to the search similarity scores when searching over an English corpus. Thus we use the candidates obtained for the lexical translation of cm-tl questions as candidates for cm-tl. This variation with candidates of cm-lt and questions of cmtl is termed as cm-lt-tl. Additionally, we show results using the candidates obtained for the English question as the candidates for all three CM question variations (cm-mt, cm-lt and cmtl). This ensures a fair comparison of all three CM question variations using TSHCNN.
In Table 4, we show results on the CM questions. Our model TSHCNN, trained on both English and Hindi questions gives the best scores. It is better by 3 -8% for various CM question variations. Although, training only on English and using bilingual embed-dings should offer performance that matches training on both English and Hindi. However, this does not happen since the bilingual embeddings are not perfect (see subsection KN-BET). We do an ablation study of the various components and describe them in more detail further.

Monolingual vs Bilingual Embeddings
Results clearly show that improvements are obtained when we use bilingual embeddings. There is an improvement of 17% for cm-tl questions when the network is trained on English and Hindi using bilingual embeddings versus using monolingual embeddings. This is because bilingual embeddings project words with similar semantics more closely. This difference is much more pronounced when we train the network only on Hindi questions. The tuples were still in English, and the misaligned semantic space (when using monolingual embeddings) for English and Hindi made it difficult for the Siamese network to learn anything meaningful. We can also observe an improvement of 11% for cm-lt questions (when trained on English and Hindi questions and using bilingual embeddings). We attribute this to the fact that CM questions have a different word order than English questions. Moreover,

Examples
Example 1: CA (have wheels will travel, book written work subjects, family) English Question: what is the have wheels will travel book about? Predicted Answer: (have wheels will travel, book written work subjects, adolescence) with the use of bilingual embeddings, our network can project both Hindi and English questions into the same semantic space, which in turn helps CM questions. The effect of monolingual embeddings is visible when we train only on Hindi. We notice accuracies for all CM question variations drop significantly.
K-Nearest Bilingual Embedding Transformation (KNBET) With k = 3, the results obtained with KNBET are higher by 16% for cm-tl trained only on English compared to no KNBET. This demonstrates that our transformation increases the effectiveness of bilingual embeddings. This is attributed to the fact that our transformation reduces the errors that bilingual embeddings may otherwise possess due to imperfect alignment.
Training on Hindi Questions Training with Hindi questions helps the network learn the different word orders that are present in Hindi questions. This improves scores for cmtl questions when trained only on Hindi. Further, joint training on both English and Hindi questions gives us the best results. SCNS: Using Solr Candidates as Negative Samples We ran experiments using 10 negative samples generated as per Bordes et al. (2014). However, the scores obtained when using a combination of both negative sample generation policies: corrupted tuples and Solr candidates, was 12.7% higher. This is a significant improvement in scores.
CQT: Additional Input, Concatenate question + tuple 10 We obtain an improve- 10 We made sure that the experiments with no CQT ment of 34% -62% in our scores when we provide additional input in the form of concatenated question and tuple. One plausible explanation for this improvement is the 50% more features for the network. To verify this, we added more filters to our convolution layer such that total features equalled that when additional input was provided. However, the improvement in results was only marginal. Another, more likely explanation would be that the max pooling layer picks out the dominant features from this additional input, and these features increase the discriminatory ability of our network.
EC: English candidates We perform experiments wherein we use the same set of candidates obtained for English questions as the candidates for all CM question variations (cmmt, cm-lt and cm-tl). Results show that cmtl questions give the highest scores on a network trained on both English and Hindi questions using bilingual embeddings. This result shows that lexical translation might not be the best strategy to tackle CM questions. Further, more techniques should be devised to handle the CM question in its original form rather than translating it at the sentence or lexical level.

Qualitative Analysis
In Table 5, some examples are shown to depict how results of transliterated CM question fare better than their translated counterparts. Example 1 shows that machine translation fails to translate the CM question correctly. The predicted answer is henceforth incorrect. Example 2 highlights limitations for lexical translation. Lexically translated questions lose their intended meaning if a word has multiple possible translations and it results in an incorrect prediction.

Conclusion
This paper proposes techniques for Code-Mixed Question Answering over a Knowledge Base in the absence of direct supervision of CM questions for training neural models. We use only monolingual data 11 and bilingual emhad the same number of features as that of with CQT.
11 The language identification system uses CM data. We could instead use a rule-based system using no CM data without much loss in performance. beddings to achieve promising results. Our TSHCNN model shows impressive results for English QA. It outperforms many other complicated architectures that use Bi-LSTMs and Attention mechanisms. We also introduce two techniques which significantly enhance results. KNBET reduces the errors that may exist in bilingual embeddings and could be used by any system working with bilingual embeddings. Additionally, negative samples obtained through Solr are useful for the network to learn to differentiate between fine-grained inputs. Despite imperfect bilingual embeddings, our model shows impressive results for CM QA. Our experiments highlight the need for CM QA system, since CM questions in their original form outperforms translated CM questions.