Cross-Lingual Training for Automatic Question Generation

Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. Our proposed framework clearly outperforms a number of baseline models, including a fully-supervised transformer-based model trained on the QG datasets in the primary language. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.


Introduction
Automatic question generation from text is an important yet challenging problem especially when there is limited training data (i.e., pairs of sentences and corresponding questions). Standard sequence to sequence models for automatic question generation have been shown to perform reasonably well for languages like English, for which hundreds of thousands of training instances are available. However, training sets of this size are not available for most languages. Manually curating a dataset of comparable size for a new language will be tedious and expensive. Thus, it would be desirable to leverage existing question answering datasets to help build QG models for a  new language. This is the overarching idea that motivates this work. In this paper, we present a cross-lingual model for leveraging a large question answering dataset in a secondary language (such as English) to train models for QG in a primary language (such as Hindi) with a significantly smaller question answering dataset.
We chose Hindi to be one of our primary languages.
There is no established dataset available for Hindi that can be used to build question answering or question generation systems, making it an appropriate choice as a primary language.
We create a new question answering dataset for Hindi (named Hi-QuAD): https://www.cse.iitb.ac.in/ ganesh/HiQuAD/clqg/. Figure 1 shows two examples of sentence-question pairs from Hi-QuAD along with the questions predicted by our best model. We also experimented with Chinese as a primary language. This choice was informed by our desire to use a language that was very different from Hindi. We use the same secondary language -English -with both choices of our primary language.
Drawing inspiration from recent work on unsupervised neural machine translation (Artetxe et al., 2018;Yang et al., 2018), we propose a crosslingual model to leverage resources available in a secondary language while learning to automatically generate questions from a primary language. We first train models for alignment between the primary and secondary languages in an unsupervised manner using monolingual text in both languages. We then use the relatively larger QG dataset in a secondary language to improve QG on the primary language. Our main contributions can be summarized as follows: • We present a cross-lingual model that effectively exploits resources in a secondary language to improve QG for a primary language.
• We demonstrate the value of cross-lingual training for QG using two primary languages, Hindi and Chinese.
• We create a new question answering dataset for Hindi, HiQuAD.

Related Work
Prior work in QG from text can be classified into two broad categories.
Rule-based: Rule-based approaches (Heilman, 2011) mainly rely on manually curated rules for transforming a declarative sentence into an interrogative sentence. The quality of the questions generated using rule-based systems highly depends on the quality of the handcrafted rules. Manually curating a large number of rules for a new language is a tedious and challenging task. More recently, Zheng et al. (2018) propose a template-based technique to construct questions from Chinese text, where they rank generated questions using a neural model and select the topranked question as the final output.
Neural Network Based: Neural network based approaches do not rely on hand-crafted rules, but instead use an encoder-decoder architecture which can be trained in an end-to-end fashion to automatically generate questions from text. Several neural network based approaches (Du et al., 2017;Kumar et al., 2018a,b) have been proposed for automatic question generation from text. Du et al. (2017) propose a sequence to sequence model for automatic question generation from English text. Kumar et al. (2018a) use a rich set of linguistic features and encode pivotal answers predicted using a pointer network based model to automatically generate a question for the encoded  Figure 2: Schematic diagram of our cross-lingual QG system. W Epri and W Esec refer to parameters of the encoder layers specific to the primary and secondary languages; W Dpri and W Dsec are the weights of the corresponding decoder layers. W E shared and W D shared refer to weights of the encoder and decoder layers shared across both languages, respectively. Weights updated in each training phase are explicitly listed.

All Training Phases
answer. All existing models optimize a crossentropy based loss function, that suffers from exposure bias (Ranzato et al., 2016). Further, existing methods do not directly address the problem of handling important rare words and word repetition in QG. Kumar et al. (2018b) propose a reinforcement learning based framework which addresses the problem of exposure bias, word repetition and rare words. Tang et al. (2017) and Wang et al. (2017) propose a joint model to address QG and the question answering problem together. All prior work on QG assumed access to a sufficiently large number of training instances for a language. We relax this assumption in our work as we only have access to a small question answering dataset in the primary language. We show how we can improve QG performance on the primary language by leveraging a larger question answering dataset in a secondary language. (Similarly in spirit, cross-lingual transfer learning based approaches have been recently proposed for other NLP tasks such as machine translation (Schuster et al., 2019;Lample and Conneau, 2019 Train autoencoder to generate sentence xp from noisy sentencexp in primary language and similarly xs fromxs in the secondary language. 3 Back Translation: Generate sentences x p and xs in primary and secondary 4 languages from xs and xp respectively, using the current translation model. In Algorithm 1, we outline our training procedure and Figure 2 illustrates the overall architecture of our QG system. Our cross-lingual QG model consists of two encoders and two decoders specific to each language. We also enforce shared layers in both the encoder and the decoder whose weights are updated using data in both languages. (This weight sharing is discussed in more detail in Section 3.3.) For the encoder and decoder layers, we use the newly released Transformer (Vaswani et al., 2017) model that has shown great success compared to recurrent neural network-based models in neural machine translation. Encoders and decoders consist of a stack of four identical layers, of which two layers are independently trained and two are trained in a shared manner. Each layer of the transformer consists of a multi-headed selfattention model followed by a position-wise fully connected feed-forward network.

Unsupervised Pretraining
We use monolingual corpora available in the primary (Hindi/Chinese) and secondary (English) languages for unsupervised pretraining. Similar to Artetxe et al. (2018), we use denoising autoencoders along with back-translation (described in Section 3.1.1) for pretraining the language models in both the primary and secondary languages. Specifically, we first train the model to reconstruct their inputs, which will expose the model to the grammar and vocabulary specific to each language while enforcing a shared latent-space with the help of the shared encoder and decoder layers. To prevent the model from simply learning to copy every word, we randomly permute the word order in the input sentences so that the model learns meaningful structure in the language. If x p denotes the true input sentence to be generated from the sentence with permuted word orderx p for the primary language, then during each pass of the autoencoder training we update the weights W E pri , W E shared , W D shared and W D pri . For the secondary language, we analogously update W Esec , W Dsec and the weights in the shared layers as shown in Figure 2.

Back translation
In addition to denoising autoencoders, we utilize back-translation (Sennrich et al., 2016a). This further aids in enforcing the shared latent space assumption by generating a pseudo-parallel corpus (Imankulova et al., 2017). 1 Back translation has been demonstrated to be very important for unsupervised NMT (Yang et al., 2018;Lample et al., 2018). Given a sentence in the secondary language x s , we generate a translated sentence in the primary language,x p . We then use the translated sentencex p to generate the original x s back, while updating the weights W Esec , W E shared , W D shared and W D pri as shown in Figure 2. Note that we utilize denoising autoencoding and back-translation for both languages in each step of training.

Supervised Question Generation
We formulate the QG problem as a sequence to sequence modeling task where the input is a sentence and the output is a semantically consistent, syntactically correct and relevant question in the same language that corresponds to the sentence. Each encoder receives a sentence x (from the corresponding language) as input and the decoder generates a questionȳ such thatȳ = arg max y P (y|x), and P (y|x) = |y| t=1 P (y t |x, y <t ), where probability of each subword y t is predicted conditioned on all the subwords generated previously y <t and the input sentence x. We initialize the encoder and decoder weights using unsupervised pretraining and finetune these weights further during the supervised QG model training. Specifically, in each step of training, we update the weights W Esec , W E shared , W D shared and W Dsec using QG data in the secondary language and W E pri , W E shared , W D shared and W D pri using QG data in the primary language.

More Architectural Details
We make three important design choices: 1. Use of positional masks: Shen et al. (2018) point out that transformers are not capable of capturing within the attention, information about order of the sequence. Following Shen et al. (2018), we enable our encoders to use directional self attention so that temporal information is preserved. We use positional encodings which are essentially sine and cosine functions of different frequencies. More formally, positional encoding (PE) is defined as: where m is a hyper-parameter, pos is the position, d model is the dimensionality of the transformer and i is the dimension. Following Vaswani et al. (2017), we set m to 10000 in all our experiments. Directional self attention uses positional masks to inject temporal order information. Based on Shen et al. (2018), we define a forward positional mask (M f ) and a backward positional mask (M b ), that processes the sequence in the forward and backward direction, respectively.
2. Weight sharing: Based on the assumption that sentences and questions in two languages are similar in some latent space, in order to get a shared language independent representation, we share the last few layers of the encoder and the first few layers of the decoder (Yang et al., 2018). Unlike Artetxe et al. (2018); Lample et al. (2018), we do not share the encoder completely across the two languages, thus allowing the encoder layers private to each language to capture languagespecific information. We found this to be useful in our experiments.
3. Subword embeddings: We represent data using BPE (Byte Pair Encoding) (Gage, 1994) embeddings. We use BPE embeddings for both unsupervised pretraining as well as the supervised QG training phase. This allows for more fine-grained control over input embeddings compared to word-level embeddings (Sennrich et al., 2016b). This also has the advantage of maintaining a relatively smaller vocabulary size. 2

Experimental Setup
We first describe all the datasets we used in our experiments, starting with a detailed description of our new Hindi question answering dataset, "HiQuAD". We will then describe various implementation-specific details relevant to training our models. We conclude this section with a description of our evaluation methods.

HiQuAD
HiQuAD (Hindi Question Answering dataset) is a new question answering dataset in Hindi that we developed for this work. This dataset contains 6555 question-answer pairs from 1334 paragraphs in a series of books called Dharampal Books. 3 Similar to SQuAD (Rajpurkar et al., 2016), an English question answering dataset that we describe further in Section 4.1.2, HiQuAD also consists of a paragraph, a list of questions answerable from the paragraph and answers to those questions. To construct sentence-question pairs, for a given question, we identified the first word of the answer in the paragraph and extracted the corresponding sentence to be paired along with the question. We curated a total of 6555 sentencequestion pairs.
We tokenize the sentence-question pairs to remove any extra white spaces. For our experiments, we randomly split the HiQuAD dataset into train,

Other Datasets
We briefly describe all the remaining datasets used in our experiments. (The relevant primary or secondary language is mentioned in parenthesis, alongside the name of the datasets.) IITB Hindi Monolingual Corpus (Primary language: Hindi) We extracted 93,000 sentences from the IITB Hindi monolingual corpus 4 , where each sentence has between 4 and 25 tokens. These sentences were used for unsupervised pretraining.

IITB Parallel Corpus (Primary language:
Hindi) We selected 100,000 English-Hindi sentence pairs from IITB parallel corpus (Kunchukuttan et al., 2018) where the number of tokens in the sentence was greater than 10 for both languages. We used this dataset to further fine-tune the weights of the encoder and decoder layers after unsupervised pretraining.
DuReader (He et al., 2018) Chinese Dataset: (Primary language: Chinese) This dataset consists of question-answer pairs along with the question type. We preprocessed and used "DESCRIP-TION" type questions for our experiments, resulting in a total of 8000 instances. From this subset, we created a 6000/1000/1000 split to construct train, development and test sets for our experiments. We also preprocessed and randomly extracted 100,000 descriptions to be used as a Chinese monolingual corpus for the unsupervised pretraining stage.
News Commentary Dataset: (Primary language: Chinese) This is a parallel corpus of news commentaries provided by WMT. 5 It contains roughly 91000 English sentences along with their Chinese translations. We preprocessed this dataset and used this parallel data for fine-tuning the weights of the encoder and decoder layers after unsupervised pretraining.
SQuAD Dataset: (Secondary language: English) This is a very popular English question answering dataset (Rajpurkar et al., 2016). We used the train split of the pre-processed QG data released by Du et al. (2017) for supervised QG training. This dataset consists of 70,484 sentencequestion pairs in English.

Implementation Details
We implemented our model in TensorFlow. 6 We used 300 hidden units for each layer of the transformer with the number of attention heads set to 6. We set the size of BPE embeddings to 300. Our best model uses two independent encoder and decoder layers for both languages, and two shared encoder and decoder layers each. We used a residual dropout set to 0.2 to prevent overfitting. During both the unsupervised pretraining and supervised QG training stages, we used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e−5 and batch size of 64.

Unsupervised Pretraining
For Hindi as the primary language, we use 93000 Hindi sentences from the IITB Hindi Monolingual Corpus and around 70000 English sentences from the preprocessed SQuAD dataset for unsupervised pretraining. We pretrain the denoising autoencoders over 15 epochs. For Chinese, we use 100000 Chinese sentences from the DuReader dataset for this stage of training.

Supervised Question Generation Training
We used 73000 sentence-question pairs from SQuAD and 4000 sentence-question pairs from HiQuAD (described in Section 4.1.1) to train the supervised QG model in Hindi. We used 6000 Chinese sentence-question pairs from the DuReader dataset to train the supervised QG model in Chinese. We initialize all the weights, including the BPE embeddings, from the pretraining phase and fine-tune them until convergence.

Evaluation Methods
We evaluate our systems and report results on widely used BLEU (Papineni et al., 2002), ROUGE-L and METEOR metrics. We also performed a human evaluation study to evaluate the quality of the questions generated. Following Kumar et al. (2018a), we measure the quality of questions in terms of syntactic correctness, semantic correctness and relevance. Syntactic correctness measures the grammatical correctness of a generated question, semantic correctness measures naturalness of the question, and relevance measures how relevant the question is to the text and answerability of the question from the sentence.

Results
We present our automatic evaluation results in Table 2, where the primary language is Hindi or Chinese and the secondary language in either setting is English. We do not report on Chinese as a secondary language owing to the relatively poor quality of the Chinese dataset. Here are all the models we compare and evaluate: • Transformer: We train a transformer model (Vaswani et al., 2017) using the QG dataset in the primary language. This serves as a natural baseline for comparison. 7 This model consists of a two-layer encoder and a two-layer decoder.
• Transformer+pretraining: The abovementioned Transformer model undergoes an additional step of pretraining. The encoder and decoder layers are pretrained using monolingual data from the primary language. This model will help further demonstrate the value of cross-lingual training.
• CLQG: This is our main cross-lingual question generation model (described in Section 3) where the encoder and decoder layers are initialized in an unsupervised pretraining phase using primary and secondary language monolingual corpora, followed by a joint supervised QG training using QG datasets in the primary and secondary languages.
• CLQG+parallel: The CLQG model undergoes further training using a parallel corpus (with primary language as source and secondary language as target). After unsupervised pretraining, the encoder and decoder weights are fine-tuned using the parallel corpus. This fine-tuning further refines the language models for both languages and helps enforce the shared latent space across both languages. We observe in Table 2 that CLQG+parallel outperforms all the other models for Hindi. For Chinese, parallel fine-tuning does not give significant improvements over CLQG; this could be attributed to the parallel corpus being smaller in size (when compared to Hindi) and domain-specific (i.e. the news domain).

Discussion and Analysis
We closely inspect our cross-lingual training paradigm using (i) a human evaluation study in Section 6.1 (ii) detailed error analysis in Section 6.2 and (iii) ablation studies in Section 6.3. All the models analyzed in this section used Hindi  Table 3: Human evaluation results as well as inter-rater agreement (column "Kappa") for each model on the Hindi test set. The scores are between 0-100, 0 being the worst and 100 being the best. Best results for each metric (column) are in bold. The three evaluation criteria are: (1) syntactic correctness (Syntax), (2) semantic correctness (Semantics), and (3) relevance to the paragraph (Relevance).  as the primary language. 8

Human evaluation
We conduct a human evaluation study comparing the questions generated by the Transformer and CLQG+parallel models. We randomly selected a subset of 100 sentences from the Hindi test set and generated questions using both models. We presented these sentence-question pairs for each model to three language experts and asked for a binary response on three quality parameters namely syntactic correctness, semantic correctness and relevance. The responses from all the experts for each parameter was averaged for each model to get the final numbers shown in Table 3. Although we perform comparably to the baseline model on syntactic correctness scores, we obtain significantly higher agreement across annotators using our cross-lingual model. Our crosslingual model performs significantly better than 8 Figure 5 shows two examples of correctly generated Chinese questions.  the Transformer model on "Relevance" at the cost of agreement. On semantic correctness, we perform signficantly better both in terms of the score and agreement statistics.

Error Analysis
Correct examples: We show several examples where our model is able to generate semantically and syntactically correct questions in Figure 3. Figure 3b shows our model is able to generate questions that are identical to human-generated questions. Fig. 3c demonstrates that our model can generate new questions which clearly differ from the human-generated questions but are syntactically correct, semantically correct and relevant to the text. Fig. 3a shows a third question which differs from the human-generated question in only a single word but does not alter its quality.
Incorrect examples: We also present a couple of examples where our model is unable to generate good questions and analyze possible reasons for the same. In Fig. 4a, Fig. 4b shows a question which is syntactically correct and relevant to the main subject, but is not consistent with the given sentence.

Ablation Studies
We performed two experiments to better understand the role of each component in our model towards automatic QG from Hindi text.

Importance of unsupervised pretraining
We construct a model which does not employ any unsupervised or supervised pretraining but uses the same network architecture. This helps in studying the importance of pretraining in our model. We present our results in Table 4. We observe that our shared architecture does not directly benefit from the English QG dataset with simple weight sharing. Unsupervised pretraining (with back-translation) helps the shared encoder and decoder layers capture higher-level languageindependent information giving an improvement of approximately 7 in BLEU-4 scores. Additionally, the use of parallel data for fine-tuning unsupervised pretraining aids this process further by improving BLEU-4 scores by around 3 points.

Importance of secondary language resources
To demonstrate the improvement in Hindi QG from the relatively larger English SQuAD dataset, we show results of using only HiQuAD during the main task in Table 5; unsupervised and supervised pretraining are still employed. We obtain modest performance improvements on the standard evaluation metrics (except ROUGE-L) by using English SQuAD data in the main task. These improvements (albeit small) demonstrate that our proposed cross-lingual framework is a step in the right di- rection towards leveraging information from a secondary language.
6.4 How many sentence-question pairs are needed in the primary language?
To gain more insight into how much data is required to be able to generate questions of high quality, Fig. 6 presents a plot of BLEU scores when the number of Hindi sentence-question pairs is varied. Here, both unsupervised and supervised pretraining are employed but the English SQuAD dataset is not used. After significant jumps in BLEU-4 performance using the first 2000 sentences, we see a smaller but steady improvement in performance with the next set of 2000 sentences.

Conclusion
Neural models for automatic question generation using the standard sequence to sequence paradigm have been shown to perform reasonably well for languages such as English, which have a large number of training instances. However, large training sets are not available for most languages.
To address this problem, we present a crosslingual model that leverages a large QG dataset in a secondary language (along with monolingual data and parallel data) to improve QG performance on a primary language with a limited number of QG training pairs. In future work, we will explore the use of cross-lingual embeddings to further improve performance on this task.