Learning with Limited Data for Multilingual Reading Comprehension

This paper studies the problem of supporting question answering in a new language with limited training resources. As an extreme scenario, when no such resource exists, one can (1) transfer labels from another language, and (2) generate labels from unlabeled data, using translator and automatic labeling function respectively. However, these approaches inevitably introduce noises to the training data, due to translation or generation errors, which require a judicious use of data with varying confidence. To address this challenge, we propose a weakly-supervised framework that quantifies such noises from automatically generated labels, to deemphasize or fix noisy data in training. On reading comprehension task, we demonstrate the effectiveness of our model on low-resource languages with varying similarity to English, namely, Korean and French.


Introduction
Reading comprehension question answering (RCQA) is one of many well-known NLP tasks, to answer questions based on text understanding. For RCQA, a well-known resource is SQuAD (Rajpurkar et al., 2016) with 100K QA data created by human, followed by NarrativeQA (Kočiskỳ et al., 2018), SQuAD 2.0 (Rajpurkar et al., 2018), and CoQA (Reddy et al., 2018). However, as these datasets support only English, supporting other languages requires either annotation efforts in a comparable scale (Lim et al., 2018), or modeling efforts to overcome the limitation of training resources in terms of quantity or quality. Our work pursues the latter goal.
To illustrate, consider an extreme scenario of bootstrapping a RCQA model for a new language with no labelled resource. We can overcome the * First two authors equally contributed to this work. † correspond to seungwonh@yonsei.ac.kr limitation in quantity by generating alternative low-quality resources. First, neural machine translation (NMT) can convert existing English annotations into a target language (Faruqui and Kumar, 2015;Ture and Boschee, 2016). For example, (passage p, question q, answer a) in SQuAD can be translated into (p , q , a ) in a target language (Asai et al., 2018;. Second, automatic labeling function (Du and Cardie, 2018;Serban et al., 2016) can be adopted to generate synthetic labels in a target language. For example, Du and Cardie (2018) leverages Question Generator (QG) and Answer Extractor (AE) as labeling function for RCQA, to create training resources of a virtually infinite amount from unlabeled corpora. However, overcoming the quantity limitation using the above methods, leads to quality problems. As a consequence, some of the following assumptions on quality may cease to hold, after errors introduced from generation. (1) Answerability: the semantic of the translated passage or generated question may shift, so that a question gets unanswerable after generation, or (2) Answer alignment: the answer span in a target language may become incorrect. Our goal is estimating the quality of resources and even improving the quality, to overcome the limited quality of training resources.
For the goal, we exploit noisy data from the above two sources, translator and labeling function, considered as weak labels, and pursue robust learning overcoming noises: Our key contribution is Refinery network that predict confidence -quality of (p, q, a) instance into a score in the range of 0 to 1. While most existing weakly supervision approach focuses on generating positive examples, we leverage synthetic negative examples for training Refinery, optimized to distinguish positive from negative one.
As shown in Figure 1, a training procedure of  Figure 1: Overview of our approach QA and Refinery is iterative: QA model is used for confidence estimation and Refinery is used to determine which noisy instances should be deemphasized in training, or whether to modify wrong labels, towards the direction of improving QA model. As our framework does not re-train or access data generation models, our approach is agnostic to generation model, such that it can extend to generation models not discussed here. For experiments on RCQA in a new language, our model is evaluated on three humanannotated sets. Our experiments show that our method with Refinery augments training data and outperforms all state-of-the-arts. This observation can be generalized over similar and distant language pairs (English with French and Korean respectively), and with and without significant topic overlaps: To validate this claim, we evaluate over widely adopted human generated evaluation sets on Korean and French Wikipedia documents. In addition, our reported gains are orthogonal to a pre-trained model, such that we can easily leverage the strength of the latest BERT model (Devlin et al., 2018), and contribute additional gains. Our implementation is available at: https://github.com/lkj0509/multilingual MRC.

Preliminaries
RCQA problem is defined based on SQuAD dataset (Rajpurkar et al., 2016) in English, which consists of 23,215 passages and 100k+ questionanswer pairs requiring the understanding of corresponding passages to answer correctly. Let p be a passage with m words, i.e., p = {w p 1 , w p 2 , ..., w p m }, and q be a question with n words, i.e., q = {w q 1 , w q 2 , ..., w q n }. Then, given a pair of passage p and question q, the objective of RCQA is to estimate a consecutive answer span a, i.e., a = {w p i , w p i+1 , ..., w p j } where a ⊆ p. For task evaluation, the estimated answer span a is compared with the ground truth answer span a * in terms of F1 score at word-level.
The majority of RCQA models (Seo et al., 2016;Yu et al., 2018;Devlin et al., 2018) consist of (a) encoding the passage and question into a fixed-size vector, then (b) decoding to predict the probability of each position in the passage being the start or end of an answer span. As a basic QA model, we start from BiDAF model (Seo et al., 2016), which is a widely known open-source code. 1 BiDAF model uses bi-directional LSTM networks with attention mechanism to align question with passage and vice-versa. More specifically, the probabilities of the starting and ending position are modeled as: where h 1 , h 2 are trainable weights, and M 0 , M 1 , M 2 are the hidden states at each LSTM layer to represent the passage words according to the given question. The probability of s-to-e span of the passage is defined as follows: In test, the model selects the answer span with the highest probability (Eq. (2)) during postprocessing.

Data Generation for New Language
This section introduces how we generate weak supervisions, from machine translation and synthetic label generation.

Method I: Neural Machine Translation
SQuAD is a set of (p, q, a) triple, where answer a to question q can be found as a consecutive substring match in the passage p (i.e., a ⊆ p). Due to this property of SQuAD, translating the triple (p, q, a) is non-trivial: when p and a are translated into target language, p t and a t , a t ⊆ p t may no longer hold. Therefore, we need to find the answer span a t in the target language, by matching with a s in the source language s. We overviewed a highprecision baseline  where a s and p s are independently translated, and a t is found only when the translation of a s is exactly and consecutively found in that of p s . However, this suffers low-recall, considering we found only 53.6% spans among whole translated Q-A pairs by the high-precision method.
To complement, we propose a perfect-recall alignment to find 46.4% spans that cannot be extracted by the above method. Our method is extensible to any language with an existing open-source NMT, leveraging only its attention scores (Bahdanau et al., 2014).
(1) One-to-one alignment: A widely adopted simplifying assumption for machine translation is that each target word is aligned to one source language word (Brown et al., 1993). Based on attention score in NMT, each word in a s can be aligned with word in the target language, by selecting highest score. After 1-to-1 matching, we select the longest sub-sequence as answer span a t .
(2) Span-to-span alignment: One-to-one assumption is simplifying, by treating values in the alignment matrix as binary and excluding a possibility that a word can be aligned with multiple target words. Instead, we align span-to-span, to calculate a "soft" score between a s and a t , and change the span boundary dynamically. To illustrate, assume a t = {i, ..., j} is the answer span found from the one-to-one alignment. Before finalizing this answer, we may ask whether changing the boundary i and j improves the match 2 . Formally, when S m,n denotes (m, n)-th element in the alignment matrix, we can evaluate the match score of between a s and a t by averaging all pairwise combinations in the two spans: S = average(S m,n ) for ∀m ∈ a s , ∀n ∈ a t . If changing the target boundary a t = {i , ..., j } improves this score, we will make a modification. Specifically, we consider updating the end position to j in the range of j ± N , within a pre-defined window size N , and enqueue a possible update j if the average score of S for a s and a t is higher than that of a s and a t . Similarly, we compare the scores of start position in the range of i ± N to enqueue a possible update i . After this, highest-scoring update pair (i , j ) in the queue (Hwang and Chang, 2005) can be aligned.

Method II: Synthetic Label Generation
This section introduces an automatic labeler which can generate question-answer pairs from unlabeled text crawled for the target language. Training data collected via NMT is inherently skewed to specific domains answerable for English-speaking area (e.g., United States), which causes the domain gap with test data on a new language. Automatic labeler can complement by collecting unlabeled text on the domains not covered, and generating synthetic labels.
For generating synthetic training data, we leverage Question Generator (QG) and Answer Extractor (AE) as labeler, following (Du and Cardie, 2018). Given a passage in a target language, our goal is to generate question-answer pairs, related to the given passage. In (Du and Cardie, 2018), BiLSTM-CRF model (Huang et al., 2015) is used for AE, classifying whether the word belongs to the answer span. For QG, sequence-to-sequence model (Bahdanau et al., 2014) is adopted, by setting a passage and answer candidates as input and question words as output. To train QG and AE, we use weak labels obtained from Method I. At inference time of QG and AE, we insert human-written passages, crawled from web documents in a target language (e.g., Wikipedia), and obtain pairs of question and answer as the output of QG and AE, respectively.

Weakly-supervised QA model
In this section, we illustrate how we score confidence for noisy training data obtained from the above two generation methods. We propose a new Refinery network, of not only estimating confi- dence of noisy input from multiple generators but also improving labels and QA model. Existing work (Bach et al., 2011) estimates translation error that can be used as a confidence proxy. However, it requires large labels through human-correction, and cannot apply to other generators such as QG. Our goal is confidence estimation regardless of whether it is generated from NMT or QG methods (generator-agnostic).

Refinery Network
Our Refinery aims to score the quality of the generated (p, q, a) labels, depending on whether the example is positive or negative. It is known that RCQA models trained solely on positive examples fail on unanswerable cases, assuming that a correct answer is guaranteed to exist in the passage (Rajpurkar et al., 2018). A simple remedy is to manually augment training resources with negative examples, to teach how to distinguish them from positive examples. Our key contribution is automatic collection of both positive and negative training examples for training confidence prediction, which we will discuss in details below.
Our research question now is then: how do we obtain positive and negative examples? For positive examples, we treat weak labels from the generation methods as "pseudo-positive", since it would be positive, when the translation/generation is perfect. For negative examples, we generate synthetic noises, similarly to translation/generation errors, by adopting a naive but effective method to change words or sentences in a positive example (Levy et al., 2017;Zhang et al., 2019;Yang et al., 2019).
Specifically, we modify pseudo-positive labels to generate synthetic negative examples of unanswerable or wrong answer. For unanswerable cases, (1) we first replace a question with its semantic neighbor derived from other documents (e.g., "who first discovered america?" into "who first discovered canada?"). For this, we use the average of word embedding and the cosine distance. Second, (2) we modify a passage by removing a sentence containing the answer span, so that the modified passage is the subset of original one, but unanswerable. Lastly, for wrong answers, (3) we perturb an answer span into a random span in the same passage, which is answerable but with an incorrect label. We generate negative examples from the above (1)-(3) methods (33.3% each). As we will show later in Section 5, these negative examples are empirically effective for training Refinery.
The training objective is that CS of pseudopositive example should be greater than that of negative one, with at least margin δ. To deal with pseudo-positive examples unequally, we set the margin δ dynamically, which is derived from F1 score of QA model. That is, we relatively assign a larger margin to a higher confidence instance, and smaller otherwise. The magnitude of margin depends on the F1 score of the predicted answer from QA model. Considering the margin δ, the final objective function is computed as follows: where α is a hyper-parameter, CS + i indicates the confidence of positive example, and CS − i is that of negative example. Now, to obtain the confidence score, we use BiDAF (Seo et al., 2016) as QA model, as shown in Figure 2 (Left). First, to model answerability of (p, q), we extract the feature of (p, q) from QA model. For this, we concatenate the two hidden states M 1 and M 2 from BiDAF, as mentioned in Section 2.1. Then, the representations are aggregated by LSTM and attention layer, as follows: where W 1 , W 2 , v 1 , b 1 , b 2 are trainable weights, and σ is tanh function. The vector R pq containing the information of both passage and question will be used for confidence score. Second, to model the confidence of answer labels, we represent answer spans in the labels. For this, we concatenate two hidden states M 1 s and M 2 e of s-th start and e-th end words in the answer. The final confidence score CS is obtained from R pq and the answer feature, as follows: where v 2 and b 3 are trainable weights, and [ ; ] indicates concatenation. An extension to the above model is to replace BiDAF with BERT (Devlin et al., 2018), a pretrained model on a masked language model task. Due to the commonality of BiDAF and BERT, taking question and passage as input and generating passage representations for answer span prediction, we can apply BERT to our approach, as shown in Figure 2 (Center), by modifying Refinery module of modeling confidence score. Meanwhile, architectural differences of the two, such as BiDAF using bi-attention layers and BERT using self-attention layers, require some minor changes: In BERT, as the output of CLS token can represent the input sequence of passage and question, we use the output vector of CLS as R pq . For R span , we use the hidden states at the layer of BERT, instead of M in Eq. (5). We will show the effectiveness of our approach on both BiDAF and BERT model in Section 5.

Improving QA and Weak Labels
During an iterative procedure of QA and Refinery, our approach improves both QA model and weak labels. Using confidence, Refinery acts as two functions: (1) down-weighting instances with low-quality, and (2) answer modification with low-quality for higher quality.
The objective of original QA model minimizes the sum of the negative log probabilities: where N is the number of examples in training set, s i and e i are the start and end indices of the i-th example, respectively. For down-weighting noise, we combine this QA loss function with weighted confidence score, based on function f (CS(p, q)): However, in initial steps of training, the confidence score is not reliable, and may cause unstable training. To avoid the problem, we apply an annealing technique, adjusting the contribution of the confidence score. We set the initial function f to be uniform constant, and then gradually increase the contribution of instance weighting as learning steps. That is, we design the function f as follows: where t is the number of current iteration step, c and λ are hyper-parameters. Besides down-weighting noisy instances, we could change some answers in weak labels to higher-quality one, while unanswerable passagesquestion pairs are difficult to modify. To improve such labels, we compare confidence scores between the answer a * predicted by QA model and a in the weak label. If the confidence increase is greater than threshold γ, we change the answer a to the model-generated answer a * when each epoch ends. Finally, we train the QA and Refinery modules jointly, as Eq. (3) and (7), which are updated in turns 3 , until convergence. Algorithm 1 describes this procedure in detail.

Experiment
In this section, we address the following research questions: • RQ1: Does our proposed work outperform existing approaches? Does it generalize for languages with diverse distance or topics?
• RQ2: Does our model generally work on more noisy environment?
For machine translation, we use pre-trained and open-sourced NMT models for French 5 and Korean 6 . As hyper-parameters, α, c, λ and γ, are set to 1/10, 10, 1/4000, and 0.5 respectively, optimized by dev set. For QG and AE implementations, we follow the setting of NQG model . In BiDAF model, we used Natural Language Toolkit (NLTK) and KoNLPY 7 for French/Korean tokenizer, and FastText (Mikolov et al., 2017) for word embedding.

Evaluation of Full QA Model
We compare our QA model with the following baselines: Base1&2: Asai et al. (2018) translate (p, q, a) in target language into English, then test on pretrained English QA model. The answer from model is translated back to target language. We present their reported F1 scores for French with and without ELMo (Peters et al., 2018), and leave unreported results blank.
Base3:  proposed semisupervised method, to train the weak model on a small human-labeled set, and evaluate translated QA pairs by the output probabilities (in Eq. (2)) of the weak QA model. For training QA, the translated data over threshold is used together with a human-labeled data. We present their reported F1 scores and leave unreported results blank.
Ours: Our proposed models are trained on BiDAF and BERT, and with/without Refinery (for ablation purposes described below).  Table 1 shows the comparison on En-Kr (distant) and En-Fr (close) language pairs. Similarly, we contrast in diverse topic-similarity scenarios: History (distant) and Wiki (close). When compared with Base1&2 (Test on source), our model outperforms the two models, even when Base1 is boosted by ELMo (Peters et al., 2018). When compared with Base3, semisupervision  using 2K human labels outperforms our BiDAF model (without Refinery), on Wiki Kr set. However, ours with Refinery outperforms the Base3 model, which means that large and noisy labels refined by our method have comparable quality to small but strong labels.
To show that the effectiveness of our approach is orthogonal to QA model, we construct two QA models, BiDAF and BERT, as in Table 1. When compared with No weighting, our full model on BiDAF improved 2.7%, 3.3% and 2.6% of F1 scores, and that on BERT improved 2.3%, 2.6% and 2.7% of F1 score, on Wiki Fr, Wiki Kr and History Kr, respectively. BERT model with our Refinery performs the best among all models, by adding the power of pre-trained representations. Table 2, we conduct an ablation study on both BiDAF and BERT models, by examining the effects, from removing each component. In (A), we replace our confidence score (CS) with the probabilities in Eq.

Ablation Study As shown in
(2), which is similar to the use of confidence in . We normalize the probabilities in mini-batch by softmax function, then apply to down-weighting. On the QA model using the replaced confidence, the performance decreased significantly, suggesting the importance of CS. In (B), we remove a dynamic margin in Eq. (3), by setting the margin as a constant value instead (i.e., δ = 1). Using static margin lowered the performance of both QA models, suggesting that QA feedback (dynamic margin) is effective for refining noises. In (C), we remove answer modification in Section 4.2, while preserving down-weighting module. Compared with our full model, adding answer modification improved the performance on both QA models. We can also observe the effect of down-weighting, as the model without answer modification still outperforms that with No weighting in Table 1. For RQ2, we design scenarios with noisier training data to demonstrate the robustness. When training QA, we replace positive examples with negative examples, perturbing an input to fool a machine model. Figure 3(a) shows the robustness of our BiDAF model over varying noise ratios. As  noises increase, Base BiDAF model sharply drops its F1 score, while that of our full model degrades gracefully.

Evaluation of Confidence Score
This section addresses RQ3, aiming at directly observing whether our confidence score effectively distinguishes the positive from negative. As baselines, we conduct (A) and (B) in the above ablation study. (A) Base1 uses the probability in Eq.
(2) as confidence score. (B) Base2 is Refinery without a dynamic margin (δ = 1). In this experiment, we use BiDAF model on Wiki Kr and History Kr, and generate negative examples from positive test set as 1:1 ratio. As a quantitative evaluation, we measure three metrics of detecting negative examples, comparing confidence of positive and negative one.
• Area under the Receiver Operating Characteristic Curve (AUROC).
• Area under the Precision -Recall Curve (AUPR). Table 3, our Refinery outperforms other baselines with statistical significance of p < 0.01 and by a large margin on all measures, showing Refinery is effective in distinguishing positivenegative pairs. We also compare our work with and without a dynamic margin from QA feedback, to show that such strategy improves the performance.

As shown in
To show performance of QA/Refinery over iterations, we check their performance as iterations continue on dev set, in Figure 3(b). In early iterations, Refinery cannot contribute to QA, because F1 score of all instance is zero. After the training stabilizes, the performance of Refinery increases rapidly. Figure 3 Table 4 compares the examples in Baseline and our model, where the first two are high-quality data and the third and forth are not. In the first example, both models show high score, while Baseline underestimates the quality and Refinery works correctly for the second example. The third example is unanswerable on (p, q) pair, since the words about the subject disappeared during translation, which is assigned to lower scores by two models. The last example has also translation error where the given question was mistranslated from 'when' to 'how'. In this case, our Refinery network successfully assigned low confidence score, as the question is no longer answerable after a bad translation, but Baseline gives a high confidence.

Related Work
Multilingual Task: NMT has played an important role in addressing multilinguality. For example, a straightforward solution for RCQA is translating (p, q) in target language to English and ap- ply English-based models, then translate the answer back (Asai et al., 2018). Not only for question answering, such method has been successful in sentiment classication (Zhou et al., 2016), relation extraction (Faruqui and Kumar, 2015), and causal commonsense (Yeo et al., 2018). However, these approaches are dependent on quality of translation. RCQA for Resource-Poor Language To overcome the lack of RCQA resources on other language, in , they proposed a semisupervised strategy to remove the noise instances. They hand-annotated a small seed set of QA pairs to train a weak QA. Then, the weak QA indirectly judges the quality of machine-translated SQuAD pairs, by considering the output probability as confidence score. As confidence score, they selectively remove translated pairs below a fixed threshold. Meanwhile, we eliminate the needs for human annotation, by leveraging positive and negative sets from translated and generated data. Combining models of QG and QA: Several works (Wang et al., 2017;Tang et al., 2017; propose a framework jointly optimizing QG and QA model as dual. These approaches share commonality with our approach for generating training data with another model, but with the following crucial difference: Existing work can use a reliable training set annotated by human, which can be strong hint to decide the quality of generated data. For example, in (Tang et al.), QG gener-ates questions on the given answer sentence, then the generated question is compared with the original question in training set. Meanwhile, we do not require human annotation, and rather leverage automatically generated from NMT and QG models, judiciously guided by our Refinery network.

Conclusion
This paper studies a zero-resource training data generation for supporting RCQA to a new target language. Inspired by limited resources, we generate RCQA data, by using existing NMT and QG techniques. To explore the noisy generated data, we proposed an integrated QA model with Refinery to control the generated data as confidence score. Our results showed that our strategies using Refinery and generated data enhance the performance of existing QA model, and Refinery has effectiveness to distinguish positive-negative pairs.