Improving Text Generation with Student-Forcing Optimal Transport

Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. An extension is further proposed to improve the OT learning, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.


Introduction
Natural language generation is an essential component of many NLP applications, such as machine translation (Bahdanau et al., 2015), image captioning (You et al., 2016), text summarization (See et al., 2017), dialogue systems (Vinyals and Le, 2015), and machine comprehension (Nguyen et al., 2016).Generating human-like natural language is typically cast as predicting a sequence of consecutive words in a recurrent manner.Maximum likelihood estimation (MLE) is commonly employed as the objective to train such textgeneration models, maximizing the log-likelihood of producing the ground-truth tokens within a sentence or paragraph (Salakhutdinov, 2015).In Recurrent Neural Network (RNN) models, this is also known as Teacher-Forcing (TF) (Williams and Zipser, 1989), due to the use of ground-truth tokens for next-token prediction.
However, in the maximum likelihood paradigm, previous observed tokens are usually provided during training, giving rise to an issue termed exposure bias (Bengio et al., 2015): the ground-truth * * Equal contribution tokens seen by the model during training are not available at inference time.During inference, the model is required to use outputs from the last step instead of the unseen ground-truth, which is often referred to as Student-Forcing (SF).As a result, there is a discrepancy between training and inference, accumulating errors along the sequencegeneration trajectory (Ranzato et al., 2016a).
This challenge has been addressed by incorporating model-generated text at training time.Bengio et al. (2015) proposed scheduled sampling (SS), where the training samples are systematically tampered with by replacing some of the ground-truth tokens with model-predicted tokens.Other works regard text generation as a sequential decision making problem, applying reinforcement learning (RL) techniques (Ranzato et al., 2016b;Bahdanau et al., 2017).In particular, quantitative evaluation metrics such as BLEU and ROGUE are used as sequence-level rewards for modelgenerated texts.
Despite the encouraging results reported, concerns have been raised w.r.t. the above strategies.Scheduled sampling is known to be statistically inconsistent and fails to address the fundamental issues (Huszár, 2015).RL solutions usually suffer from slow and unstable training due to the high variance of REINFORCE-based policy gradients; consequently, specific training techniques are often needed (Rennie et al., 2017;Liu et al., 2018).Additionally, the results from RL models often do not correlate well with human evaluations, as the rewards used are typically biased towards specific aspects of a language model (Wang et al., 2018).
On the other hand, recent developments in likelihood-free modeling techniques, most prominently the generative adversarial networks (GANs), tackle exposure bias in a more principled fashion (Lamb et al., 2016;Yu et al., 2017;Zhang et al., 2017;Li et al., 2017).GAN-based text models, however, suffer from a number of severe dif-ficulties, including mode collapse, where generated text looks real but lacks necessary diversity (Zhu et al., 2018;Caccia et al., 2018).Additionally, the training of GAN-based models is often unstable, and model learning easily breaks down in the event of vanishing or exploding gradients (Arjovsky et al., 2017;Zhang et al., 2017).Therefore, existing adversarial methods may not be able to match sentences generated by student-forcing with ground-truth sentences.
To mitigate the challenges from the adversarial methods, we utilize a sequence-matching loss based on Optimal Transport (OT), which avoids a neural discriminator and delivers a smoother gradient for the generator.Recently, Chen et al. (2019) leverage OT loss based on a Teacher-Forcing scheme, however, it degenerates to wordlevel matching, making it difficult to capture temporal semantic information.In this work, in order to enable sequence-level matching, we instead propose an OT-based sequence-level training scheme to directly optimize the discrepancy loss between the ground-truth and free-running text samples.Further, we introduced various OT cost functions for loss calculation.The significance of this work is two-fold: firstly, this approach alleviates the exposure bias, boosting model performance at the inference stage by using a sequencelevel objective between free-running output and reference.Secondly, with the use of OT, this approach provides a direct objective that is easy and robust to optimize, without biasing towards a specific, manually-defined metric.
Our work provides the following contributions: i) We introduce a novel method for text generation called Student-Forcing OT (SFOT), leveraging OT loss to improve long-term sequence sampling.ii) A new context-preserving OT approach is proposed to effectively match a text sequence with order information.iii) We examine the necessity of integrating OT with Student-Forcing via Imitation Learning.iv) The proposed models are robust demonstrated by extensive empirical evaluations on Neural Machine Translation (NMT), Text Summarization, and Neural Text Generation (NLG).

Student Forcing Optimal Transport
To reduce exposure bias, the output sequences of the generator in the teacher-forcing (training) and student-forcing (inference) stages should be indis-tinguishable.Therefore, we propose to use OT loss to measure sequence matching distance between the two stages, in conjunction with the maximum likelihood estimate.

Maximum Likelihood Estimate
We denote the training dataset as N sequence pairs D = {x n , y n } N n=1 , with output sequence x = [x 1 , • • • , x T ] and input sequence y = [y 1 , • • • , y T ].Depending on the specific task, y may have different definitions.For seq2seq models like neural machine translation, y represents the source sequence, conditioned on which the target sequence x is generated.For language modeling tasks, y is empty, and x becomes the unconditionally generated sequence.
To generate a text sequence, neural language models (Mikolov et al., 2010) generate every token x t conditioned on the previous tokens in an auto-regressive manner: where θ are model parameters, and x <t indicates all tokens before t.Learning θ is often performed with maximum likelihood estimation (MLE): (2) To facilitate MLE training, the teacher-forcing scheme is considered.The word tokens x <t from the ground-truth sequence are fed into (1) to generate the next token: Learned neural language models are often evaluated using the student-forcing scheme.The previously generated word tokens of the model are conditioned to generate the next token: The difference between (3) and (4) reveals a gap between training and evaluation in the MLE method.To reduce the gap, a natural idea is to define a tractable function to measure the discrepancy between ground-truth and SF-generated text sequences, to regularize TF-based MLE learning.x < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 e f N n 6 d 1 D r z F k F H X 7 K N s a E 7 M s t 8 = " > A A A B + H i c b V D L S g M x F L 1 T X 7 U + O u r S T b A I r s q M C L o s u n F Z w T 6 g H U o m k 7 a h m W R I M m I d + i V u X C j i 1 k 9 x 5 9 + Y a W e h r Q d C D u f c S 0 5 O m H C m j e d 9 O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b 7 / q H h y 2 t U w V o S 0 i u V T d E G v K m a A t w w y n 3 U R R H I e c d s L J T e 5 3 H q j S T I p 7 M 0 1 o E O O R Y E N G s L H S w K 1 m / V D y S E 9 j e 6 H H 2 c C t e X V v D r R K / I L U o E B z 4 H 7 1 I 0 n S m A p D O N a 6 5 3 u J C T K s D C O c z i r 9 V N M E k w k e 0 Z 6 l A s d U B 9 k 8 + A y d h m s y I y x g o T Y 7 u q 2 B L 8 5 S + v k v Z 5 3 f f q / t 1 F r X F d 1 F G G Y z i B M / D h E h p w C 0 1 o A Y E U n u E V 3 p w n 5 8 V 5 d z 4 W o y W n 2 D m C P 3 A + f w A m k p N k < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 e f N n 6 d 1 D r z F k F H X 7 K N s a E 7 M s t 8 = " > A A A B + H i c b V D L S g M x F L 1 T X 7 U + O u r S T b A I r s q M C L o s u n F Z w T 6 g H U o m k 7 a h m W R I M m I d + i V u X C j i 1 k 9 x 5 9 + Y a W e h r Q d C D u f c S 0 5 O m H C m j e d 9 O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b 7 / q H h y 2 t U w V o S 0 i u V T d E G v K m a A t w w y n 3 U R R H I e c d s L J T e 5 3 H q j S T I p 7 M 0 1 o E O O R Y E N G s L H S w K 1 m / V D y S E 9 j e 6 H H 2 c C t e X V v D r R K / I L U o E B z 4 H 7 1 I 0 n S m A p D O N a 6 5 3 u J C T K s D C O c z i r 9 V N M E k w k e 0 Z 6 l A s d U B 9 k 8 + A y d h m s y I y x g o T Y 7 u q 2 B L 8 5 S + v k v Z 5 3 f f q / t 1 F r X F d 1 F G G Y z i B M / D h E h p w C 0 1 o A Y E U n u E V 3 p w n 5 8 V 5 d z 4 W o y W n 2 D m C P 3 A + f w A m k p N k < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 e f N n 6 d 1 D r z F k F H X 7 K N s a E 7 M s t 8 = " > A A A B + H i c b V D L S g M x F L 1 T X 7 U + O u r S T b A I r s q M C L o s u n F Z w T 6 g H U o m k 7 a h m W R I M m I d + i V u X C j i 1 k 9 x 5 9 + Y a W e h r Q d C D u f c S 0 5 O m H C m j e d 9 O 6 W 1 9 Y 3 N r f J 2 Z W d 3 b 7 / q H h y 2 t U w V o S 0 i u V T d E G v K m a A t w w y n 3 U R R H I e c d s L J T e 5 3 H q j S T I p 7 M 0 1 o E O O R Y E N G s L H S w K 1 m / V D y S E 9 j e 6 H H 2 c C t e X V v D r R K / I L U o E B z 4 H 7 1 I 0 n S m A p D O N a 6 5 3 u J C T K s D C O c z i r 9 V N M E k w k e 0 Z 6 l A s d U B 9 k 8 + A y d B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a e W q u a n 9 U 4 5 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 D u P X u v 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 D u P X u v 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a e W q u a n 9 U 4 5 x 6 4 e 4 8 2 + c t F l o 6 4 F h D u f c y 5 w 5 f i K 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 x 6 4 e 4 8 2 + c t F l o 6 4 F h D u f c y 5 w 5 f i K 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 x 6 4 e 4 8 2 + c t F l o 6 4 F h D u f c y 5 w 5 f i K 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 D u P X u v 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a e W q u a n 9 U 4 5 B X e v Y n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 D u P X u v 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 M G / l p t K a P d e W k 0 D u P X u v 3 s e y r o q 3 6 u 0 E / s D 7 / A H O R Z I B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a e W q u a n 9 U 4 5 q y X q x 3 6 2 M + W r H K n U P 4 A + v z B 9 7 j l g 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M A Q g l G h 7 Z E 5 / f p c 8 q y X q x 3 6 2 M + W r H K n U P 4 A + v z B 9 7 j l g 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M A Q g l G h 7 Z E 5 / f p c 8 q y X q x 3 6 2 M + W r H K n U P 4 A + v z B 9 7 j l g 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M A Q g l G h 7 Z E 5 / f p c 8 q y X q x 3 6 2 M + W r H K n U P 4 A + v z B 9 7 j l g 4 = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " G k j d G E l h x T s G E J c 6 R q u q 3 e N a j F s = " , where s is the start token.In (a), the ground-truth (GT) sequence x is compared with TF-generated sequence x to produce the cost matrix C T ; In (b), the GT sequence x is compared with SFgenerated sequence x to produce the cost matrix C S .

Student Forcing Optimal Transport
We propose to use optimal transport (OT) to measure the discrepancy between the student-forcing generated sequence x and the ground-truth sequence x.Assuming there is an oracle/target distribution µ(x) to generate x, our goal is to learn θ such that the model distribution p θ ( x) matches µ(x).Formally, OT provides a distance metric between the two probability measures µ and p on a domain X (the sequence of word tokens): where Π(µ, p) denotes the set of all joint distributions γ(x, x) with marginals µ(x) and p( x).The function c(x, x) : X × X → R defines the cost of moving x to x. Intuitively, OT provides a method of matching the sequence x to x with the minimum cost, given µ, p and c(•, •).
OT distance on discrete domains For discrete distributions µ, p on X, we have µ = T i=1 u i δ x i and p = T j=1 p j δ xj with δ x the Dirac function centered on x.The weight vectors u = {u i } T i=1 ∈ ∆ T and p = {p i } T i=1 ∈ ∆ T belong to the Tdimensional simplex, i.e., T i=1 u i = T j=1 p j = 1, as both µ and p are probability distributions.Under such a setting, computing the OT distance is equivalent to solving the following network-flow problem (Luise et al., 2018): where Π(u, p) = {M ∈ R T ×T + |M1 T = u, M 1 T = p}, 1 T denotes a T -dimensional allone vector, C is the cost matrix given by C ij = c(x i , xj ), and M, C = Tr(M C) represents the Frobenius dot-product.We refer to the minimizer M * of (6) as OT matching.
Summarizing, our student-forcing optimal transport (SFOT) objective is: where λ is the weighting hyper-parameter that balances the MLE and OT losses.In practice, we only take one sample in student-forcing.Note that our SFOT objective considers R ot (x, x).This is a key difference from Chen et al. (2019), where teacherforcing is used in OT R ot (x, x).To note the difference, we refer to the method in Chen et al. (2019) as TFOT.
The exact minimization over M is generally computationally intractable (Arjovsky et al., 2017;Genevay et al., 2018).Hence we use the recently introduced Inexact Proximal point method for Optimal Transport (IPOT) (Xie et al., 2018) algorithm to approximate M * .The details of the IPOT algorithm are shown in Appendix A.1.

Cost Functions in SFOT
OT-matching quality largely depends on the cost function c(•, •).In particular, there is flexibility in how we represent the elements to be transported, for which we outline two alternatives below.
Vanilla OT A natural choice for the cost distance is to use the word embeddings, denoted {h 0 t } T t=1 , as used by previous works: However, a word-embedding-based cost function only captures the token-to-token similarity.On the other hand, the semantics of words can be different in different positions or contexts.This inspires the proposal of two novel cost functions to improve text sequence matching in OT.Contextualized OT with Order-Preserving Regularizer Because the same word in different linguistic contexts may have different meanings, a cost function that cannot capture such variability may lead to undesirable matching results.While word embeddings {h 0 t } T t=1 may be myopic, hidden representations {h t } T t=1 at higher layers ( > 0) of deep language models (e.g.LSTM (Hochreiter and Schmidhuber, 1997) or the Transformer (Vaswani et al., 2017)) often capture contextualized representations of the word in the sequence.Inspired by works on deep contextualized word representations (Peters et al., 2018;Devlin et al., 2018;Radford et al., 2018), we replace the word embeddings with {h t } T t=1 to represent the meaning of words inside the sequence.Then the cost function can be defined Figure 2 shows that the meaning of letter {A} can be less ambiguous when considering its local context {B, A} and {C, A}.This context information further helps eliminate the undesirable matching configurations.Otherwise, letter A can match to A of any position in the other sequence.Note that contextual information may implicitly capture relatively long-term dependency information compared with vanilla OT, and it can be perceived as a "soft" n-gram matching.
We also consider an order-preserving regularizer for the contextualized OT, motivated by the Compute the outputs h of the model via teacherforcing and ĥ via student-forcing 6: Compute the cost matrix C based on the choice of cost function in ( 8), (10) 7: Compute the OT Loss R ot (x, x) defined in (6) via IPOT algorithm 8: Update θ by optimizing LSFOT defined in (7) 9: end while fact that positional information of a token can be crucial in natural language understanding.For example, two sentences may have opposite meanings when the word order is changed: "He hated it but then started loving it" versus "He loved it but then started hating it".Hence, it is desirable to have the transportation matrix concentrate to diagonal entries, by transporting the neighboring elements in one sequence into some other neighboring elements in another sequence with nearby temporal position (Su and Hua, 2018).Inspired by Albregtsen et al. ( 2008), we penalize the contextual cost function with inverse difference moment as where β ≥ 0 is the weighting hyper-parameter for the order-preserving penalty.Figure 2 shows that when only considering the token-to-token similarity, the first letter A from the left sequence can be moved to any A in the right sequence.However, when considering the ordering penalty, letter A is aligned to the letters with similar position.SFOT is summarized in Algorithm 1.Note that order-preserving regularization is only applied to contextualized OT, as requiring position-wise matching for the generated and target sentence may be too restrictive.Empirically, we observed adding the order-preserving regularizer to vanilla OT gives marginal improvement.However, since contextualized OT operates on a feature space, the position-wise matching is softened and can be naturally coupled with contextual cost function.

An Imitation Learning Interpretation of Student-Forcing
The sequential text generation process can be reformulated using Markov decision processes (MDPs).Formally, an MDP M = S, A, P s , r 1 defines a transition probability P s (s |s, a) from state s ∈ S to the next state s ∈ S after the agent takes an action a ∈ A; r(s, a) is an unknown reward function.To cast text generation as an MDP, we may consider the action as the selection of the next word x t from the vocabulary, conditioned on the states of observed words x <t , i.e., s t = x <t and a t = x t .At each time step t, the agent takes an action a t at state s t according to policy π(a t |s t ), and receives a reward r(s t , a t ).
The conjunction of the generated word and the previous methods constitute the next state s t+1 .Imitation learning seeks to learn an optimal policy from demonstrations of an expert policy π E .In language models, the training text plays the role of expert trajectories.This objective is formally: where ρ π (s, a) is the stationary joint distribution of (s, a) induced by the learning policy π; H(π) is the entropy.Intuitively, the objective encourages that higher rewards are assigned to the expert policy π E , while π is trained to mimic π E .
Importantly, (11) suggests that each individual word of sentences that induces the distribution ρ π should be fully generated by the learned language model π.In other words, at each time step, π generates a word using its own previously generated words.This is exactly the student-forcing scheme we employ for the proposed SFOT algorithm.This reveals the key difference with Chen et al. (2019), where teaching-forcing is employed.Since such an agent takes actions based on the partial expert trajectories, it will induce a biased occupancy measure.This results in a sub-optimal policy for text generation, even when the imitation learning objective (11) is optimized.

Related Work
Text Generation Natural Language Generation (NLG) is a challenging NLP task.Neural language models parameterized by autogressive architectures are widely used for NLG.To improve the global control ability of generated sentences, variational auto-encoders are considered for language generation (Bowman et al., 2016; Fu et al.,   1 The discount factor is set as one for simplicity 2019; Fang et al., 2019;Li et al., 2020a).Recently, GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) improve the generation fluency via pre-training on massive text corpus.All of them are trained with MLE using Teacher-Forcing, which are known to suffer from exposure bias in principle (Bengio et al., 2015).Several methods have been proposed to solve the problem, including (Shao et al., 2018;Zhang et al., 2019).Adversarial training techniques were also proposed (Yu et al., 2017;Zhu et al., 2018;Che et al., 2017;Lin et al., 2017;Guo et al., 2018;Chen et al., 2018;Li et al., 2020b;Yang et al., 2019;Zhang et al., 2018;Liang et al., 2018).However, adversarial-based NLG models can suffer from gradient vanishing and unstable training.Indeed, (Caccia et al., 2018) argues that a temperature sweeping approach on MLE can outperform GAN-based models.Our model further improves this work by adopting a principled sequence-matching loss via optimal transport and achieve state-of-the-art results on NLG tasks.
Optimal Transport Optimal transport is widely employed for a variety of NLP tasks, including document classification (Kusner et al., 2015), word embedding space alignment (Alvarez-Melis and Jaakkola, 2018), and generative adversarial networks (Chen et al., 2018).The most related work to ours is TFOT Chen et al. (2019).We discuss the difference between the proposed SFOT and TFOT as follows: • Strong empirical evidence on long sentences.
While the overall performance of SFOT is superior to TFOT only by a decent margin on standard datasets, we emphasize that the main advantage of SFOT is for long sentences.As shown in the break-down analysis (Figure 4), SFOT is significantly better than TFOT when sentence length is larger than 60.This highlights the critical contribution of this paper to addressing the exposure bias problem, which longer sentences usually suffer from.
• Methodology novelty.Besides the difference in SF decoding and TF decoding in two methods, we also propose a technique on "Contextualized OT with Order-Preserving Regularizer", which improves both SFOT and TFOT, as shown in Table 4.Note that, there is no order information used in TFOT, which de-generates it to word embedding alignment instead of sequence matching.In contrast, the proposed technique can utilize sequence level, thus improving the performance.
• Theoretical difference.In Section 2.3, we provide theoretical justification on why SFOT can reduce exposure bias, while TFOT still suffers from it: TFOT is based on partial expert trajectories and induces a bias occupancy measure, while our proposed method SFOT uses previous self-generated words and can obtain an optimal policy.
Sequence matching Direct sequence matching has been explored widely in various machine learning tasks.Jaccard distance has been used to retrieve prototypes for sentence generation (Guu et al., 2018).Chernoff distance has been applied to image classification (Su et al., 2015).However, these distances assume each instance in the sequence is independent, ignoring temporal information.These distances consequently measure sequence alignment poorly, as they miss semantic relationships inside a sentence (e.g.cause and effect).Dynamic time warping (DTW) (Sakoe et al., 1990) and Connectionist Temporal Classification (CTC) (Graves and Jaitly, 2014) consider temporal information, and have been adopted widely in speech recognition.However, these losses preserve strict ordered alignment, and hence cannot be directly applied to text sequences.

Experiments
We perform experiments on neural machine translation (NMT), abstractive text summarization and unconditional natural language generation (NLG) tasks.Algorithms are implemented in Tensorflow and trained on an NVIDIA TITAN X GPU.

Neural Machine Translation
Two standard datasets are tested for NMT tasks: a small-scale English-Vietnamese corpus from the IWSLT 2015 Evaluation Campaign (Cettolo et al., 2015) and a large-scale English-German corpus from the WMT16 Evaluate Campaign 2 .Further details of the datasets and the experimental setup are shown in Appendix A.2.
We compare SFOT with a variety of methods: MLE (Luong et al., 2017)  i.e.RAML (Norouzi et al., 2016), SPG (Ding andSoricut, 2017), andMIXER (Ranzato et al., 2016b).The results are summarized in Tables 1 and 2. The proposed SFOT approach consistently improves upon MLE training and outperforms other models in all experimental setups.Besides the quantitative results, we observe that SFOT correctly maintains the information from the source side to make correct translations (Table 3).As can be seen from these examples, our model can better preserve information from the source and is less likely to transfer words incorrectly.Notice that most errors in the baseline models occur in the latter part of sequences, due to error accumulation from exposure-bias, which SFOT addresses by matching the free-running outputs to the ground-truth.In conjunction with the quantitative results presented above, these qualitative observations confirm that our model can generate more reliable translation for long sentences and address the exposure-bias problem.
We further investigate the performance on the English-Vietnamese dataset.We first check the convergence of the BLEU score on the validation set, shown in Figure 3. SS converges similarly to the MLE model and only shows marginal improvement.The models with OT converge much  faster in the early stages and finish with a higher final performance.However, TFOT has an unstable convergence trajectory, as it can degrade MLE training performance.SFOT by contrast is consistently better than MLE training, achieving the best BLEU score among the three models.
To further demonstrate SFOT's ability to address exposure bias, we follow Bahdanau et al. (2015) and group sentences of similar lengths together, computing a BLEU score per group on the test set.As errors accumulate in generation, longer sentences suffer more from exposure bias and have lower quality.Figure 4 shows that our model is more effective in handling long sentences: compared with other methods, SFOT is more robust for longer sentence lengths, indicating that matching the generated and ground-truth sentence in the sequence level may alleviate exposure bias for longtext generation.
Additionally, we compare different OT cost function variants in Table 4.We denote contextualized OT with order-preserving regularizer asc.Contextualized OT improves translation quality when used with TFOT and SFOT.The improvements on both models indicate that contextualized representations with order preservation is capable of capturing more sentence semantic information.

Abstractive Text Summarization
We use a widely considered English Gigawords corpus (Graff et al., 2003) for the textsummarization task.Similar to NMT experiments, we use MLE as our baseline model and further compare SFOT with SS, TFOT, and several RLbased methods, i.e.RAML, SPG and MIXER.We evaluate the model performance using ROUGE (including -1, -2, -L) score (Lin, 2004), the most popular metric for summarization.Details of the datsetsets and the experimental setup are shown in Appendix A.3.
Summarization results are provided in Table 5.Consistent with our NMT results, SFOT outperforms all other methods, showing that the contextualized matching is capable of capturing semantic information essential for high-quality generation.Moreover, the superiority of SFOT over RLbased models demonstrates that the OT (sentence) matching is more robust than word/phrase matching in RL rewards.

Neural Language Generation
Following recent unconditional long text generation work (Caccia et al., 2018), we perform experiments on EMNLP2017 WMT News dataset3 .All sentences are longer than 20, making the dataset appropriate for testing the exposure bias problem.
To evaluate the effectiveness of our model, we consider various baseline methods, including recent GAN-based text generation approaches, such as SeqGAN (Ramachandran et al., 2017), RankGAN (Lin et al., 2017), MaliGAN (Che et al., 2017), and LeakGAN (Guo et al., 2018), as well as an MLE-trained model using temperature sweep (Caccia et al., 2018).We apply SFOT with contextualized OT to improve the temperaturesweep MLE model.SS does not show significantly different results compared to the MLE model.For the evaluation metric, we follow the current protocol for NLG evaluation (Zhu et al., 2018) w.r.t.both quality and diversity.Specifically, the quality of the generation is measured with BLEU score (Papineni et al., 2002) and the diversity is evaluated with Self-BLEU (Zhu et al., 2018).Human evaluation is further considered to measure the quality of generation 4 .More details Reference: India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.

MLE:
India ' s new prime minister , Narendra Modi , meets his Japanese counterpart , Shinzo Abe , in Tokyo , during his first major foreign visit in May to discuss economic and security relations .TFOT: India ' s new prime minister Narendra Modi meets his Japanese counterpart, Shinzo Abe, in Tokyo at his first major foreign visit since his election in May in order to discuss economic and security relations .

SFOT:
India ' s new prime minister , Narendra Modi , is meeting his Japanese counterpart Shinzo Abe in Tokyo in his first major foreign visit since his election victory in May to discuss economic and security relations.

Reference:
Chinese leaders presented the Sunday ruling as a democratic breakthrough because it gives Hong Kongers a direct vote, but the decision also makes clear that Chinese leaders would retain a firm hold on the process through a nominating committee tightly controlled by Beijing.

MLE:
The Chinese leadership presented the decision of Sunday as a democratic breakthrough , because it gives Hong Kong citizens a direct right to vote , but the decision also makes it clear that the Chinese leadership maintains the expiration of a nomination committee closely controlled by Beijing .

TFOT:
The Chinese leadership presented Sunday ' s decision as a democratic breakthrough because it gives the citizens of Hong Kong a direct right to vote , but the decision also makes it clear that the Chinese leadership keeps the process firmly in the hands of a government -controlled Nomination Committee.

SFOT:
The Chinese leadership presented the decision on Sunday as a democratic breakthrough , because Hong Kong citizens have a direct electoral right , but the decision also makes it clear that the Chinese leadership remains firmly in hand with a nominating committee controlled by Beijing.
Table 3: Comparison of German-to-Enlish translation examples.For each example, we show the human translation (reference) and the translation from MLE, TFOT, and SFOT.We highlight the key phrase differences between reference and translation outputs in blue and red, and annotate translation errors in bold.In the first example, SFOT correctly maintains all the information in "since winning in May election" by translating to "since his election victory in May", whereras MLE only generates "in May" and TFOT also misses "winning" in the reference.In the second example, SFOT successfully keeps the information "Beijing", whereas MLE generates wrong words "expiration of" and TFOT changes "Beijing" to "government". of experiment setup are shown in Appendix A.4.
To reasonably select the best model along the temperature sweep, we are motivated by (Gu et al., 2019) and propose the BLEU-F1 score to evaluate model.Ten native speakers are asked to rate each sentence in the scale 1 to 5 in terms of readability and meaningfulness the trade-off between the quality and diversity simultaneously, defined as BLEU-F1 = 2 × BLEU × (1-Self-BLEU) BLEU + (1-Self-BLEU) .(12) Figure 5 shows the BLEU-F1 score versus reverse temperature on MLE and SFOT.We observed that the best temperature for MLE model is 1/1.5 and for SFOT is 1/1.4.We further conduct analysis under these temperatures.Figure 5 also indicates that the SFOT model consistently improves the MLE model on the BLEU-F1 score.
We compare SFOT with the proposed strong baselines in Figure 6 and report human evaluation of generated quality in Table 6.We observe that SFOT has the highest generation quality in human evaluation.With better guided sequence-level semantics information, SFOT generates high-quality sentences at higher temperatures, compared with the MLE model.The MLE model decreases temperature to concentrate on generating safe words (with high probability); this avoids error accumulation by avoiding risky words.However, the model loses generation diversity when increasing temperature, as the model only focuses on safe words.SFOT obtains better quality at higher temperature, indicating SFOT can generate reasonable sequences on more risky words and hence   can address exposure bias and gain more diversity in generation.We also observe that SFOT outperforms all text GANs in terms of qualitydiversity trade-off and human evaluation.Under similar Self-BLEU score, SFOT significantly improves the quality of LeakGAN (Guo et al., 2018), the best GAN by BLEU metric.

Conclusions
We have introduced SFOT to mitigate exposure bias in text generation.The proposed model captures positional and contextual information of word tokens in OT matching.Experiments on neural machine translation, text summarization, and text generation have demonstrated the effectiveness of our SFOT algorithm, yielding improved performance over strong baselines on these tasks.
tion task.We follow the pre-process in (Rush et al., 2015).The dataset is sampled and split into train/dev/test set with size 200K/8K/2K.

A.4 Natural Language Generation Experiment
In the NLG experiment, 200K sentences are collected as the training set and 10K sentences as the test set.In the NLG experiment, we set OT weighting parameter λ = 1 and the orderpreserving penalty weighting parameter is β = 0.1.Since input sequence y is empty in a language model, to better guide the student forcing output, we adopt schedule sampling with ratio 0.3 in our experiments.The samples generated by SFOT are presented in Table 7 .
So , this is a great way , but I'm not sure how to do that , he said .
When made an emergency landing , the driver who was also injured in the blast was arrested on suspicion of causing death .
The result is that the company's economic growth rate is rising by a substantial margin in November, which is even higher than a year ago .
It's really a big deal for us and we're going to get ready for the second game .
We feel like it's very hard to say that it's really going to be the next generation .
You don't want to be a kid , and there are a lot of things that you can do .
The government's decision to extend its coal policy vote will be announced in the first half of 2017 .
I'm not able to do it , but I think it's pretty important for him to be the best player .
But I'm not sure what the supporters can do in this election , he said , referring to the Sanders campaign .
t e x i t s h a 1 _ b a s e 6 4 = " 1 e f N n 6 d 1

Figure 1 :
Figure1: Comparison of (a) teacher-forcing (TF) and (b) student-forcing (SF), where s is the start token.In (a), the ground-truth (GT) sequence x is compared with TF-generated sequence x to produce the cost matrix C T ; In (b), the GT sequence x is compared with SFgenerated sequence x to produce the cost matrix C S .

Figure 2 :
Figure 2: Illustration of the intuition of the proposed OT extensions.The goal is to match the sequence ABACA to the sequence AABAC.Left: traditional OT, where only the token information is considered; thus letter A in any positions of the two sequences can be aligned.Right: the proposed extensions of OT, where the context and ordering information help to eliminate undesirable alignment.In terms of the context, letter A will be matched to those that share a similar context {A, B} and {A, C}.In terms of ordering, letter A will be aligned to the letters with similar positions .Contextualized OT with Order-Preserving Regularizer Because the same word in different linguistic contexts may have different meanings, a cost function that cannot capture such variability may lead to undesirable matching results.While word embeddings {h 0 t } T t=1 may be myopic, hidden representations {h t } T t=1 at higher layers ( > 0) of deep language models (e.g.LSTM(Hochreiter and Schmidhuber, 1997) or the Transformer(Vaswani et al., 2017)) often capture contextualized representations of the word in the sequence.Inspired by works on deep contextualized word representations(Peters et al., 2018;Devlin et al., 2018;Radford et al., 2018), we replace the word embeddings with {h t } T

Figure 4 :
Figure 4: Translation quality as sentences become longer on NT2013 EN-VI.

Table 1 :
VI-EN and EN-VI translation BLEU scores.

Table 2 :
DE-EN and EN-DE translation BLEU scores.

Table 4 :
BLEU scores for VI-EN and EN-VI ablation study.

Table 5 :
Results of text summarization on English Gigawords dataset.

Table 6 :
Human evaluation of NLG on EMNLP news 2017 dataset.100 generated sentences from each model are rated 1-5, with means and standard deviations reported.Real sentences were rated 4.21 ± 0.44.

Table 7 :
Examples generated by SFOT in NLG experiments