Learning to Actively Learn Neural Machine Translation

Traditional active learning (AL) methods for machine translation (MT) rely on heuristics. However, these heuristics are limited when the characteristics of the MT problem change due to e.g. the language pair or the amount of the initial bitext. In this paper, we present a framework to learn sentence selection strategies for neural MT. We train the AL query strategy using a high-resource language-pair based on AL simulations, and then transfer it to the low-resource language-pair of interest. The learned query strategy capitalizes on the shared characteristics between the language pairs to make an effective use of the AL budget. Our experiments on three language-pairs confirms that our method is more effective than strong heuristic-based methods in various conditions, including cold-start and warm-start as well as small and extremely small data conditions.


Introduction
Parallel training bitext plays a key role in the quality neural machine translation (NMT). Learning high-quality NMT models in bilingually lowresource scenarios is one of the key challenges, as NMT's quality degrades severely in such setting (Koehn and Knowles, 2017).
Recently, the importance of learning NMT models in scarce parallel bitext scenarios has gained attention. Unsupervised approaches try to learn NMT models without the need for parallel bitext (Artetxe et al., 2017;Lample et al., 2017a). Dual learning/backtranslation tries to start off from a small amount of bilingual text, and leverage monolingual text in the source and target language (Sennrich et al., 2015a;. Zero/few shot approach attempts to transfer NMT learned from rich bilingual settings to low-resource settings (Johnson et al., 2016;Gu et al., 2018).
In this paper, we approach this problem from the active learning (AL) perspective. Assuming the availability of an annotation budget and a pool of monolingual source text as well as a small training bitext, the goal is to select the most useful source sentences and query their translation from an oracle up to the annotation budget. The queried sentences need to be selected carefully to get the value for the budget, i.e. get the highest improvements in the translation quality of the retrained model. The AL approach is orthogonal to the aforementioned approaches to bilingually lowresource NMT, and can be potentially combined with them.
We present a framework to learn the sentence selection policy most suitable and effective for the NMT task at hand. This is in contrast to the majority of work in AL-MT where hard-coded heuristics are used for query selection Bloodgood and Callison-Burch, 2010). More concretely, we learn the query policy based on a high-resource language-pair sharing similar characteristics with the low-resource language-pair of interest. After trained, the policy is applied to the language-pair of interest capitalising on the learned signals for effective query selection. We make use of imitation learning (IL) to train the query policy. Previous work has shown that the IL approach leads to more effective policy learning , compared to reinforcement learning (RL) (Fang et al., 2017) . Our proposed method effectively trains AL policies for batch queries needed for NMT, as opposed to the previous work on single query selection.
We conduct experiments on three language pairs Finnish-English, German-English, and Czech-English. Simulating low resource scenarios, we consider various settings, including cold-start and warm-start as well as small and extremely small data conditions. The experiments show the effectiveness and superiority of our policy query compared to strong baselines.

Learning to Actively Learn MT
Active learning is an iterative process: Firstly, a model is built using some initially available data. Then, the most worthwhile data points are selected from the unlabelled set for annotation by the oracle. The underlying model is then re-trained using the expanded labeled data. This process is then repeated until the budget is exhausted. The main challenge is how to identify and select the most beneficial unlabelled data points during the AL iterations.
The AL strategy can be learned by attempting to actively learn on tasks sampled from a distribution over the tasks (Bachman et al., 2017). We simulate the AL scenario on instances of a lowresource MT problem created using the bitext of the resource-rich language pair, where the translation of some part of the bitext is kept hidden. This allows to have an automatic oracle to reveal the translations of the queried sentences, resulting in an efficient way to quickly evaluate an AL strategy. Once the AL strategy is learned on simulations, it is then applied to real AL scenarios. The more related are the low-resource language-pair in the real scenario to those used to train the AL strategy, the more effective the AL strategy would be.
We are interested to train a translation model m φ φ φ which maps an input sentence from a source language x x x ∈ X to its translation y y y ∈ Y x x x in a target language, where Y x x x is the set of candidate translations for the input x x x and φ φ φ is the parameter vector of the translation model. Let D = {(x x x, y y y)} be a support set of parallel corpus, which is randomly partitioned into parallel bitext D lab , monolingual text D unl , and evaluation D evl datasets. Repeated random partitioning creates multiple instances of the AL problem.

Hierarchical MDP Formulation
A crucial difference of our setting to the previous work (Fang et al., 2017; is that the AL agent receives the reward from the oracle only after taking a sequence of actions, i.e. selection of an AL batch which may correspond to multiple training minibatches for the underlying NMT model. This fulfils the requirements for effective training of NMT, as minibatch updates are more effective than those of single sentence pairs. Furthermore, it is presumably more efficient and practical to query the translation of an untranslated batch from a human translator, rather than one sentence in each AL round. At each time step t of an AL problem, the algorithm interacts with the oracle and queries the labels of a batch selected from the pool D unl t to form b b b t . As the result of this sequence of actions to select sentences in b b b t , the AL algorithm receives a reward BLEU(m φ φ φ , D evl ) which is the BLEU score on D evl based on the retrained NMT model using the batch m b b bt φ φ φ . Formally, this results in a hierarchical Markov decision process (HMDP) for batch sentence selection in AL. A state s s s t := D lab t , D unl t , b b b t , φ φ φ t of the HMDP in the time step t consist of the bitext D lab t , the monotext D unl t , the current text batch b b b t , and the parameters of the currently trained NMT model φ φ φ t . The high-level MDP consists of a goal set G := {retrain, halt HI }, where setting a goal g t ∈ G corresponds to either halting the AL process, or giving the execution to the low-level MDP to collect a new batch of bitext b b b t , re-training the underlying NMT model to get the update parameters φ φ φ t+1 , receiving the reward R HI (s s s t , a t , s s s t+1 ) := BLEU(m φ φ φ t+1 , D evl ), and updating the new state as s s s t+1 = D lab The halt HI goal is set in case the full AL annotation budget is exhausted, otherwise the re-train goal is set in the next time step.
The low-level MDP consists of primitive actions a t ∈ D unl t ∪ {halt LO } corresponding to either selecting of the monolingual sentences in D unl t , or halting the low-level policy and giving the execution back to the high-level MDP. The halt action is performed in case the maximum amount of source text is chosen for the current AL round, when the oracle is asked for the translation of the source sentences in the monolingual batch, which is then replaced by the resulting bitext. The sentence selection action, on the other hand, forms the next state by adding the chosen monolingual sentence to the batch and removing it from the pool of monolingual sentences. The underlying NMT model is not trained as a result of taking an action in the low-level policy, and the reward function is constant zero.
A trajectory in our HMDP consists of σ := (s s s 1 , g 1 , τ 1 , r 1 , s s s 2 , ..., s s s H , g H , r H , s s s H+1 ) which is the concatenation of interleaved high-level trajectory τ HI := (s s s 1 , g 1 , r 1 , s s s 2 , .., s s s H+1 ) and low-level trajectories τ := (s s s 1 , a 1 , s s s 2 , a 2 , ..., s s s T , a T , s s s T +1 ). Clearly, the intermediate goals set by the toplevel MDP into the σ are retrain, and only the last goal g H is halt HI , where H is determined by checking whether the total AL budget B HI is exhausted. Likewise, the intermediate actions in τ h are sentence selection, and only the last action a T is halt LO , where T is determined by checking whether the round-AL budget B LO is exhausted.
We aim to find the optimal AL policy prescribing which datapoint needs to be queried in a given state to get the most benefit. The optimal policy is found by maximising the expected long-term reward, where the expectation is over the choice of the synthesised AL problems and other sources of randomness, i.e. partioing of D into D lab , D unl , and D evl . Following Bachman et al. (2017), we maximise the sum of the rewards after each AL round to encourage the anytime behaviour, i.e. the model should perform well after each batch query.

Deep Imitation learning for AL-NMT
The question remains of how to train the policy network to maximize the reward, i.e. the generalisation performance of the underlying NMT model. As the policy for the high-level MDP is fixed, we only need to learn the optimal policy for the lowlevel MDP. We formulate learning the AL policy as an imitation learning problem. More concretely, the policy is trained using an algorithmic expert, which can generate a reasonable AL trajectories (batches) for each AL state in the highlevel MDP. The algorithmic expert's trajectories, i.e. sequences of AL states paired with the expert's actions in the low-level MDP, are then used to train the policy network. As such, the policy network is a classifier, conditioned on a context summarising both global and local histories, to choose the best sentence (action) among the candidates. After the AL policy is trained based on AL simulations, it is then transferred to the real AL scenario.
For simplicity of presentation, the training algorithms are presented using a fixed number of AL iterations for the high-level and low-level MDPs. This corresponds to AL with the sentence-based budget. However, extending them for AL with token-based budget is straightforward, and we experiment with both versions in §5.
Policy Network's Architecture The policy scoring network is a fully-connected network with two hidden layers (see Figure 1). The input involves the representation for three elements: (i) global context which includes all previous AL batches, (ii) local context which summarises the previous sentences selected for the current AL batch, and (iii) the candidate sentence paired with its translation generated by the currently trained NMT model.
For each source sentence x x x paired with its translation y y y, we denote the representation by rep(x x x, y y y). We construct it by simply concatenating the representations of the source and target sentences, each of which is built by summing the embeddings of its words. We found this simple method to work well, compared to more complicated methods, e.g. taking the last hidden state of the decoder in the underlying NMT model. The global context (c c c global ) and local contexts (c c c local ) are constructed by summing the representation of the previously selected batches and sentence-pairs, respectively.

IL-based Training Algorithm
The IL-based training method is presented in Algorithm 1. The policy network is initialised randomly, and trained based T simulated AL problems (lines 3-20), by portioning the available large bilingual corpus into three sets: (i) D lab as the growing training bitex, (ii) D unl as the pool of untranslated sentences where we pretend the translations are not given, and (iii) D evl as the evaluation set used by our algorithmic expert.
For each simulated AL problem, Algorithm 1 executes T HI iterations (lines 7-19) to collect AL batches for training the underlying NMT model and the policy network. An AL batch is obtained either from the policy network (line 15) or from the algorithmic expert (lines 10-13), depending on tossing a coin (line 9). The latter also includes adding the selected batch, the candidate batches, and the relevant state information to the replay Algorithm 1 Learning AL-NMT Policy Input: Parallel corpus D, Iwidth the width of the constructed search lattices, the coin parameter α, the number of sampled AL batches K Output: policy π 1: M ← ∅ Replay Memory 2: Initialise π with a random policy 3: for T training iterations do 4: if coinToss(α) = Head then 10: π ← updatePolicy(π, M, φ φ φ) 21: return π Algorithm 2 samplePath (selecting an AL batch) if coinToss(β) = Head then 5: x x xt ← π0(S[t]) perturbation policy 6: else 7: x x xt ← arg max x x x∈S[t] π(c c cglobal, c c clocal, x x x) 8: y y yt ← oracle(x x xt) getting the gold translation 9: c c clocal ← c c clocal ⊕ rep(x x xt, y y yt) memory M , based on which the policy will be retrained. The selected batch is then used to retrain the underlying NMT model, update the training bilingual corpus and pool of monotext, and update the global context vector (lines [16][17][18][19]. The mixture of the policy network and algorithmic expert in batch collection on simulated AL problems is inspired by Dataset Aggregation DAGGER (Ross and Bagnell, 2014). This makes sure that the collected states-actions pairs in the replay memory include situations encountered beyond executing only the algorithmic expert. This informs the trained AL policy how to act reasonably in new situations encountered in the test time, where only the network policy is in charge and the expert does not exist. Algorithmic Expert At a given AL state, the algorithmic expert selects a reasonable batch from the pool, D unl via: where m b b b φ φ φ denotes the underlying NMT model φ further retrained by incorporating the batch b b b, and B B B denotes the possible batches from D unl . However, the number of possible batches is exponential in the size D unl , hence the above optimisation procedure would be very slow even for a moderately-sized pool.
We construct a search lattice S from which the candidate batches in B B B are sampled (see Figure  2). The search lattice is constructed by sampling a fixed number of candidate sentences I width from D unl for each position in a batch, whose size is T LO . A candidate AL batch is then be selected using Algorithm 2. It executes a mixture of the current AL policy π and a perturbation policy π 0 (e.g. random sentence selection or any other heuristic) in the lower-level MDP to sample a batch. After several such batches are sampled to form B B B, the best one is selected according to eqn 1.
We have carefully designed the search space to be able to incorporate the current policy's recommended batch and sampled deviations from it in B B B. This is inspired by the LOLS (Locally Optimal Learning to Search) algorithm (Chang et al., 2015), to invest efforts in the neighbourhood of the current policy and improve it. Moreover, having to deal with only I width number of sentences at each selection stage makes the batch formation algorithm based on the policy fast and efficient.
Re-training the Policy Network To train our policy network, we turn sentence preference scores to probabilities over the candidate batches,
More specifically, let (c c c global , B B B, b b b) be a training tuple in the replay memory. We define the probability of the correct action/batch as .
The preference score for a batch is the sum of its sentences' preference scores, where c c c local<t denotes the local context up to the sentence t in the batch.
To form the log-likelihood, we use recent tuples and randomly sample several older ones from the replay memory. We then use stochastic gradient descent (SGD) to maximise the training objective, where the gradient of the network parameters are calculated using the backpropagation algorithm.
Transfering the Policy We now apply the policy learnt on the source language pair to AL in the target task (see Algorithm 3). To enable transferring the policy to a new language pair, we make use of pre-trained multilingual word embeddings. In our experiments, we either use the pre-trained word embeddings from Ammar et al. (2016) or build it based on the available bitext and monotext in the source and target language (c.f. §5.2). To retrain our NMT model, we make parameter updates based on the mini-batches from the AL batch as well as sampled mini-batches from the previous iterations.

Experiment
Datasets Our experiments use the following language pairs in the news domain based on WMT2018: English-Czech (EN-CS), English-German (EN-DE), English-Finnish (EN-FI). For AL evaluation, we randomly sample 500K sentence pairs from the parallel corpora in WMT2018 for each of the three language pairs, and take 100K as the initially available bitext and the rest of 400K as the pool of untranslated sentences, pretending the translation is not available. During the AL iterations, the translation is revealed for the queried source sentences in order to retrain the underlying NMT model.
For pre-processing the text, we normalise the punctuations and tokenise using moses 1 scripts. The trained models are evaluated using BLEU on tokenised and cased sensitive test data from the newstest 2017.
NMT Model Our baseline model consists of a 2-layer bi-directional LSTM encoder with an embeddings size of 512 and a hidden size of 512. The 1-layer LSTM decoder with 512 hidden units uses an attention network with 128 hidden units. We use a multiplicative-style attention attention architecture (Luong et al., 2015). The model is optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.0001, where the dropout rate is set to 0.3. We set the mini-batch size to 200 and the maximum sentence length to 50. We train the base NMT models for 5 epochs on the initially available bitext, as the perplexity on the dev set do not improve beyond more training epochs. After getting new translated text in each AL iteration, we further sample ×5 more bilingual sentences from the previously available bitext, and make one pass over this data to re-train the underlying NMT model. For decoding, we use beam-search with the beam size of 3.

Selection Strategies
We compare our policybased sentence selection for NMT-AL with the following heuristics: • Random We randomly select monolingual sentences up to the AL budget.
• Length-based We use shortest/longest monolingual sentences up to the AL budget.
• Total Token Entropy (TTE) We sort monolingual sentences based on their TTE which has been shown to be a strong AL heuristic (Settles and Craven, 2008)  sequence-prediction tasks. Given a monolingual sentence x x x, we compute the TTE as |ŷ y y| i=1 Entropy[P i (.|ŷ y y <i , x x x, φ φ φ)] whereŷ y y is the decoded translation based on the current underlying NMT model φ φ φ, and P i (.|ŷ y y <i , x x x, φ φ φ) is the distribution over the vocabulary words for the position i of the translation given the source sentence and the previously generated words. We also experimented with the normalised version of this measure, i.e. dividing TTE by |ŷ y y|, and found that their difference is negligible. So we only report TTE results.

Translating from English
Setting We train the AL policy on a languagepair treating it as high-resource, and apply it to another language-pair treated as low-resource. To transfer the policies across languages, we make use of pre-trained multilingual word embeddings learned from monolingual text and bilingual dictionaries (Ammar et al., 2016). Furthermore, we use these cross-lingual word embeddings to initialise the embedding table of the NMT in the lowresource language-pair. The source and target vocabularies for the NMT model in the low-resource scenario are constructed using the initially available 100K bitext, and are expanded during the AL iterations as more translated text becomes available.
Results Table 1 shows the results. The experiments are performed with two limits on token annotation budget: 135k and 677k corresponding to select roughly 10K and 50K sentences in to-  tal in AL 2 , respectively. The number of AL iterations is 50, hence the token annotation budget for each round is 2.7K and 13.5K. As we can see our policy-based AL method is very effective, and outperforms the strong AL baselines in all cases except, when transferring the policy trained on EN → FI to EN → CS where it is on-par with the best baseline. Table 1, we have taken the number of tokens in the selected sentences as a proxy for the annotation cost. Another option to measure the annotation cost is the number of selected sentences, which admittedly is not the best proxy. Nonetheless, one 100K initial bitext 10K initial bitext AL method cold-start warm-start cold-start warm-start Base NMT 10.6/11.8 13.9/14.7 2.3/2.5 5.4/5.8 Random 12.9/13.3 15.1/16.2 5.5/5.6 9.3/9.6 Shortest 13.0/13.5 15.9/16.4 5.9/6.1 9.  may be interested to see how different AL methods compare against each other based on this cost measure. Table 2 show the results based on the sentencebased annotation cost. We train a policy on EN → CS, and apply it to EN → DE and EN → FI translation tasks. In addition to the token-based AL policy from Table 1, we train another policy based on the sentence budget. The token-based policy is competitive in EN → DE, where the longest sentence heuristic achieves the best performance, presumably due to the enormous training signal obtained by translation of long sentences. The token-based policy is on par with longest sentence heuristic in EN → FI for both 10K and 100K AL budgets to outperform the other methods.

Translating into English
Setting We investigate the performance of the AL methods on DE → EN based on the policies trained on the other language pairs. In addition to 100K training data condition, we assess the effectiveness of the AL methods in an extremely lowresource condition consisting of only 10K bilingual sentences as the initial bitext.
In addition to the source word embedding table that we initialised in the previous section's experiments using the cross lingual word embeddings, we are able further to initialise all of the other NMT parameters for DE → EN translation. This includes the target word embedding table and the decoder softmax, as the target language is the same (EN) in the language-pairs used for both policy training and policy testing. We refer to this setting as warm-start, as opposed to cold-start in which we only initialised the source embedding table with the cross-lingual embeddings. For the warm-start experiments, we transfer the NMT trained on 500K CS-EN bitext, based on which the policy is trained. We use byte-pair encoding (BPE) (Sennrich et al., 2015b) with 30K operations to bpe the EN side. For the source side, we use words in order to use the cross-lingual word embeddings. All parameters of the transferred NMT are frozen, except the ones corresponding to the bidirectional RNN encoder and the source word embedding table.
To make this experimental condition as realistic as possible, we learn the cross-lingual word embedding for DE using large amounts of monolingual text and the initially available bitext, assuming a multilingual word embedding already exists for the languages used in the policy training phase. More concretely, we sample 5M DE text from WMT2018 data 3 , and train monolingual word embeddings as part of a skip-gram language model using fastText. 4 We then create a bilingual EN-DE word dictionary based on the initially available bitext (either 100K or 10K) using word alignments generated by fast align. 5 The bilingual dictionary is used to project the monolingual DE word embedding space into that of EN, hence aligning the spaces through the following orthogonal projection: is the bilingual dictionary consisting of pairs of DE-EN words 6 , e e e[y i ] and e e e[x i ] are the embeddings of the DE and EN words, and Q Q Q is the orthogonal transformation matrix aligning the two embedding spaces. We solve the above optimisation problem using SVD as in Smith et al. (2017). The cross-lingual word embedding for a DE word y is then e e e[y] T · Q Q Q. We build two such cross-lingual embeddings based on the two bilingual dictionaries constructed from the 10K and 100K bitext, in order to use in their corresponding experiments.
Results Table 3 presents the results, on two conditions of 100K and 10K initial bilingual sentences. For each of these data conditions, we experiments with both cold-start and warm-start settings using the pre-trained multilingual word embeddings from Ammar et al. (2016) or those we have trained with the available bitext plus additional monotext. Firstly, the warm start strategy to transfer the NMT system from CS → EN to DE → EN has been very effective, particularly on extremely low bilingual condition of 10K sentence pairs. It is worth noting that our multilingual word embeddings are very effective, even-though they are trained using small bitext. Secondly, our policy-based AL methods are more effective than the baseline methods and lead to up to +1 BLEU score improvements.
We further take the ensemble of multiple trained policies to build a new AL query strategy. In the ensemble, we rank sentences based on each of the policies. Then we produce a final ranking by combining these rankings. Specifically, we sum the ranking of each sentence according to each policy to get a rank score, and re-rank the sentences according to their rank score. Table 3 shows that ensembling is helpful, but does not produce significant improvements compared to the best policy.

Analysis
Distribution of word frequency TTE is a competitive heuristic-based strategy, as shown in the Figure 3: On the task of DE→EN, the plot shows the log fraction of words vs the log frequency from the selected data returned by different strategies, in which we have a 677K token budget and do warm start with 100K initial bitext. The AL policy here is π CS→EN . above experiments. We compare the word frequency distributions of the selected source text returned by Random, TTE against our AL policy. The policy we use here is π CS→EN and applied on the task of DE→EN, which is conducted in the warm-start scenario with 100K initial bitext and 677K token budget. Fig. 3 is the log-log plot of the fraction of vocabulary words (y axis) having a particular frequency (x axis). Our AL policy is less likely to select high-frequency words than other two methods when it is given a fixed token budget.
Weighted combination of heuristics In order to get the intuition of which of the heuristics our AL policy resorts to, we again use policy π CS→EN and apply on the task of DE→EN, which is conducted in the warm-start scenario with 100K initial bitext and 677K token budget. Meanwhile, we get the preference scores for the sentences from the monolingual set. Then, we fit a linear regression model based on the sentences and their scores, in which the response variable is the preference score and the predictor variables are extracted features or heuristics based on the sentences. The extracted features are (length, T T E, f 0 , f 1 , f 2 , f 3+ ), where f i is the fraction of words in the sentence that appear i times in the bitext. Table 4 shows the the coeffients of these heuristics, their standard errors (SE) and t values. We can see that our AL policy considers length and TTE in parallel as they have a close range of coefficients, the policy also prefers low frequency than high frequency words.

Related Work
For statistical MT (SMT), active learning is well explored, e.g. see , where several heuristics for query sentence selection have been proposed, including the entropy over the potential translations (uncertainty sampling), query by committee, and a similarity-based sentence selection method. However, active learning is largely under-explored for NMT. The goal of this paper is to provide an approach to learn an active learning strategy for NMT based on a Hierarchical Markov Decision Process (HMDP) formulation of the pool-based AL (Bachman et al., 2017;. Expoliting monolingual data for nmt Monolingual data play a key role in neural machine translation systems, previous work have considered training a seperate language model on the target side (Jean et al., 2014;Gulcehre et al., 2015;Domhan and Hieber, 2017). Rather than using explicit language model, Cheng et al. (2016) introduced an auto-encoder-based approach, in which the source-to-target and target-to-source translation models act as encoder and decoder respectively. Moreover, back translation approaches (Sennrich et al., 2015a;Zhang et al., 2018;Hoang et al., 2018) show efficient use of monolingual data to improve neural machine translation. Dual learning  extends back translation by using a deep RL approach. More recently, unsupervised approaches (Lample et al., 2017b;Artetxe et al., 2017) and phrase-based NMT (Lample et al., 2018) learn how to translate when having access to only a large amount of monolingual corpora, these models also extend the use of back translation and cross-lingual word embeddings are provided as the latent semantic space for sentences from monolingual corpora in different languages.
Meta-AL learning Several meta-AL approaches have been proposed to learn the AL selection strategy automaticclay from data. These methods rely on deep reinforcement learning framework (Yue et al., 2012;Wirth et al., 2017) or bandit algorithms (Nguyen et al., 2017). Bachman et al. (2017) introduced a policy gradient based method which jointly learns data representation, selection heuristic as well as the model prediction function. Fang et al. (2017) designed an active learning algorithm based on a deep Q-network, in which the action corresponds to binary annotation decisions applied to a stream of data. Woodward and Finn (2017) extended one shot learning to active learning and combined reinforcement learning with a deep recurrent model to make labeling decisions. As far as we know, we are the first one to develop the Meta-AL method to make use of monolingual data for neural machine translation, the method we proposed in this paper can be applied at mini-batch level and conducted in cross lingual settings.

Conclusion
We have introduced an effective approach for learning active learning policies for NMT, where the learner needs to make batch queries. We have provides a hierarchical MDP formulation of the problem, and proposed a policy network structure capturing the context in both MDP levels. Our policy training method uses imitation learning and a search lattice to carefully collect AL trajectories for further improvement of the current policy.
We have provided experimental results on three language pairs, where the policies are transferred across languages using multilingual word embeddings. Our experiments confirms that our method is more effective than strong heuristic-based methods in various conditions, including cold-start and warm-start as well as small and extremely small data conditions.