Semantic Matching for Sequence-to-Sequence Learning

,


Introduction
Sequence-to-sequence (Seq2Seq) models are widely used in various natural-language-processing tasks, such as machine translation (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015), text summarization (Rush et al., 2015;Chopra et al., 2016) and image captioning (Vinyals et al., 2015;Xu et al., 2015).Typically, these models are based on an encoder-decoder architecture, with an encoder mapping a source sequence into a latent vector, and a decoder translating the latent vector into a target sequence.The goal of a Seq2Seq model is to optimize this encoder-decoder network to generate sequences close to the target.Therefore, a proper measure of the distance between sequences is crucial for model training.
Wasserstein distance between two text sequences, i.e., word-mover distance (Kusner et al., 2015), can serve as an effective regularizer for semantic matching in Seq2Seq models (Chen et al., 2019).Classical optimal transport models require that each piece of mass in the source distribution is transported to an equal-weight piece of mass in the target distribution.However, this requirement is too restrictive for Seq2Seq models, making direct applications inappropriate due to the following: (i) texts often have different lengths, and not every element in the source text corresponds an element in the target text.A good example is style transfer, where some words in the source text do not have corresponding words in the target text.(ii) it is reasonable to semantically match important words while neglecting some other words, e.g., conjunction.In typical unsupervised models, text data are usually non-parallel in the sense that pairwise data are typically unavailable (Sutskever et al., 2014).Thus, both pairwise information inference and text generation must be performed in the same model with only non-parallel data.Classical OT is not applicable without target text sequences.However, partial target information is available, for example, the detected objects in an image should be described in its caption, and the content words when changing the style should be preserved.OT will fail in these cases but matching can be performed by optimal partial transport (OPT).Specifically, we exploit the partial target information representation via partially matching it with generated texts.The partial matching is implemented based on lexical information extracted from the texts.We call our method SEmantic PArtial Matching (SEPAM).
To demonstrate the effectiveness of SEPAM, we consider applying it on sequence-prediction tasks where semantic partial matching is needed: (i) in unsupervised text-style transfer, SEPAM can be employed for content preservation via partially matching the input and generated text; (ii) in image captioning, SEPAM can be applied to partially match the objects detected in images with corresponding captions for more informative generation; (iii) in table-to-text generation, SEPAM can prevent hallu-cination (Dhingra et al., 2019) via partially matching tabular key words with generated sentences.
The main contributions of this paper are summarized as follows: (i) A novel semantic matching scheme based on optimal partial transport is proposed.(ii) Our model can be interpreted as incorporating prior knowledge into the optimal transport to exploit the structure of natural language, while making the algorithm tractable for real-world tasks.(iii) In order to demonstrate the versatility of the proposed scheme, we empirically show consistent improvements in style transfer for content preservation, image captioning for informative image descriptions and in table-to-text generation for faithful generation.

Background 2.1 Optimal Transport
Optimal transport defines distances between probability measures on a domain X (the wordembedding space in our setting).The optimal transport distance for two probability measures µ and ν is defined as (Peyré et al., 2017): where Π(µ, ν) denotes the set of all joint distributions γ(x, y) with marginals µ(x) and ν(y); c(x, y) : X × X → R is the cost function for moving x to y, e.g., the Euclidean or cosine distance.Intuitively, the optimal transport distance is the minimum cost that γ induces in order to transport from µ to ν.When c(x, y) is a metric on X, D c (µ, ν) induces a proper metric on the space of probability distributions supported on X, commonly known as the Wasserstein distance (Villani, 2008).
We focus on applying the OT distance on textual data.Therefore, we only consider OT between discrete distributions.Specifically, consider two discrete distributions µ, ν ∈ P(X), which can be presented as µ = n i=1 u i δ x i and ν = m j=1 v j δ y j with δ x the Dirac function centered on x.The weight vectors u , as both µ and ν are probability distributions.Under such a setting, computing the OT distance defined in (1) can be reformulated as the following minimization problem: where denotes an n-dimensional allone vector, C is the cost matrix given by C ij = c(x i , y j ) and T, C = Tr(T C) represents the Frobenius dot-product.

Optimal Partial Transport
Optimal partial transport (OPT) was studied initially by Caffarelli and McCann (2010).It is a variant of optimal transport, where only a portion of mass is to be transported, in an efficient way.In OPT, the transport problem is defined by generalizing γ as a Borel measure such that: where Π ≤ (f, g) is defined as the set of nonnegative finite Borel measures on R n × R n whose first and second marginals are dominated by f and g respectively, i.e., γ Here f and g can be considered as the maximum marginal measures for γ.As a result, if m is less than min{ f L 1 , g L 1 }, this means γ assigns zero measures for some elements of the space.In other words, the zero-measure elements need not be considered when matching µ and ν.The elements with non-zero measure are all active regions.A challenge in OPT is how to detect these active regions.Thus directly optimizing (3) is typically very challenging and computationally expensive.In our setting of text analysis, we propose to leverage prior knowledge to define the active regions, as introduced below.

Semantic Matching via OPT
In unsupervised Seq2Seq tasks without pair-wise information, naively matching the generated sequence with the weak-supervision information (e.g., source text in style transfer) will render deficient performance, even though both sentences share similar content.In supervised settings, target and input sequences are of different lengths but have similarity in terms of semantic meaning, such as table-to-text generation.Motivated by this, we propose a novel technique for semantic partial matching and consider two scenarios: (i) text-totext matching and (ii) image-to-text matching.The rooms are so bad and the food is poorly cooked either .Text-to-Text Matching We consider semantic matching between two sequences in Seq2Seq models, where partial matching is important: i) in the unsupervised setting, such as non-parallel style transfer, partially matching between the source and target texts is helpful for content preservation.ii) in the supervised setting, such as table-to-text generation, partially matching the input and target sequences can effectively avoid hallucination generation, i.e. text mentions extra information than what is present in the source.Figure 1 shows an example of partial matching, where part-of-speech (POS) tags for each word are exploited to provide prior knowledge.In these cases, directly applying OT will cause imbalanced transportation issue or poor performance.
Image-to-Text Matching Objects and their properties (e.g., colors) are both included in the pair-wise images and captions.Consider the imageto-text matching in Figure 1.It is clear that each object in the image has corresponding words/phases in the captions.We can consider matching the labels of detected objects in the image to some words in its caption.Please note labels are not in oneto-one correspondence with the text, thus directly applying OT is inappropriate, similar to the case of text-to-text matching.
Different Matching Schemes Hard matching seeks to exactly match from the source and target.Typically, hard matching is too simple to be effective without considering semantic similarity, and if we apply classical optimal transport in unsupervised settings, it causes an imbalance matching, since some unnecessary words are included in the source and the exact target is unavailable.To tackle this issue, one can directly apply the optimal partial transport (OPT) here to detect which word has its correspondence and match the word with its target.However, the detection process is computationally expensive, which is not scalable as a constrained optimization in (3) for real-world tasks.Fortunately, we can exploit the syntax information from text, and incorporate this information as prior knowledge into OPT to avoid the detection procedure.

Partial Matching via OPT
We formulate the proposed semantic matching as a partial optimal-transport problem, where only parts of the source and target are matched.Specifically, we incorporate prior knowledge into the optimal partial transport (OPT), and this prior knowledge helps determine the set of words to match, i.e., M (X), where M (•) is a function giving a set including the words/phases to match.The strategy of how to determine M (•) depends on tasks.
OPT distance To apply the OT distance to text, we first represent a sentence Y with a discrete distribution p Y = 1 T t δ e(yt) in the semantic space, where the length-normalized point mass is placed at the semantic embedding, e y t = e(y t ), of each token y t of the sequence Y .Here e(•) denotes a word-embedding function mapping a token to its d-dimensional feature representation.For two sentences X and Y , we define their OPT distance as: where Π c (µ, ν) is the solution space, and every solution . Different from classical OPT, the elements in µ or ν to match have been explicitly defined by M (•), which represents the prior knowledge.In more detail, the constraint of OPT is more specific, and does not need any optimization procedure.We use cosine distance as the cost function and c(e x , e y ) 1 − e xT e y e x 2 e y 2 .Approximation of OPT Computing the exact OPT distance is computationally challenging (Figalli, 2010).We bypass the difficulty of active region detection using lexical information and reformulate it as an OT problem.We then em- < l a t e x i t s h a 1 _ b a s e 6 4 = " L W 4 W a + 9 c 6 G r x e n 8 t A x N d e 3 T i l 1

e x i t s h a 1 _ b a s e 6 4 = " A K S H A E s 6 P y 0 0 d l R Z P e y c h M a F + N c = " >
K h p e 5 3 3 q g S j M p 7 s w o o X 6 M + 4 J F j G B j p f t u I H m o R 7 G 9 s p t x r 1 R 2 K + 4 E a J F 4 M 1 K G G e q 9 0 l c 3 l C S N q T C E Y 6 0 7 n p s Y P 8 P K M M L p u N h N N U 0 w G e I + 7 V g q c E y 1 n 0 1 S j 9 G x V U I U S W W P M G i i / t 7 I c K z z a H Y y x m a g 5 7 1 c / M / r p C a 6 8 D

e x i t s h a 1 _ b a s e 6 4 = " A K S H A E s 6 P y 0 0 d l R Z P e y c h M a F + N c = " >
K h p e 5 3 3 q g S j M p 7 s w o o X 6 M + 4 J F j G B j p f t u I H m o R 7 G 9 s p t x r 1 R 2 K + 4 E a J F 4 M 1 K G G e q 9 0 l c 3 l C S N q T C E Y 6 0 7 n p s Y P 8 P K M M L p u N h N N U 0 w G e I + 7 V g q c E y 1 n 0 1 S j 9 G x V U I U S W W P M G i i / t 7 I c K z z a H Y y x m a g 5 7 1 c / M / r p C a 6 8 D a e 1 G r 3 1 9 W G 7 d P 8 z r K c A K n c A 4 u X E E D 7 q A J L S A w g W d 4 h T c r t 1 6 s d + t j P l q y F h U e w x 9 Y n z + U k Z S w < / l a t e x i t > ploy the IPOT algorithm (Xie et al., 2018) to obtain an efficient approximation.In practice, we use a keyword mask K, defined as . Specifically, IPOT considers proximal gradient descent to solve the optimal transport matrix: where C = K • C, 1/γ > 0 is the generalized step-size, and the generalized KL-divergence D KL (T, T (t) ) is used as the proximity metric.The full approach is summarized as Algorithm 1 in Appendix A.

Semantic Partial Matching for Text Generation
Assume there are two sets of objects X = {X (i) } M i=1 and Y = {Y (j) } N j=1 , we consider a Seq2Seq model, where the input is 1 X, and the output is a sequence of length T with tokens y t , i.e., Y = (y 1 , y 2 . . ., y T ).One typically assigns the following probability to an observation y at location t: p(y|Y <t ) = [softmax(g(s t ))] y , where Y <t = (y 1 , y 2 . . ., y t ).This specifies a probabilistic model, i.e., To train the model, one typically uses maximum likelihood estimation (MLE): 1 For simplicity, we omit the superscript "i" when the context is independent of i.This applies to Y .
We consider an encoder-decoder framework in this paper, where a latent vector z is given by an encoder Enc(•), with input X, i.e., z x = Enc(X).Based on z x , a decoder G(•) generates a new sentence Ŷ that is expected to be the same as Y .The decoder can be implemented by an LSTM (Hochreiter and Schmidhuber, 1997), GRU (Cho et al., 2014), or Transformer (Vaswani et al., 2017).A unsupervised Seq2Seq model considers X and Y as non-parallel, i.e., the pair-wise information is unknown.One typically pretrains the generator with the reconstruction loss: where z y = Enc(Y ).Note the goal of unsupervised Seq2Seq is to generate a sequence Y given some object X.Hence, we seek to learn the conditional generation distributions p(Y |X), the same as the classical Seq2Seq model.In practice, the generator can be trained combining the reconstruction loss with some guidance loss containing the information from X.The guidance loss function can be defined following SEPAM, and the others usually depend on tasks and we omit their details for clarity.
Differentiable SEPAM Note that the SEPAM loss is not differentiable due to the multinomial distribution sampling process ŷt ∼ Softmax(g t ), where g t is a logit vector given by the final layer of the generator G(•).To enable direct backpropagation from the SEPAM loss for generator training, we consider the soft-argmax approximation (Zhang et al., 2017) to avoid the use of REIN-FORCE (Sutton et al., 2000): where 0 < τ < 1 is the annealing factor.Given two sentences, we denote the generated sequence embeddings as S g = {ẽ y i } T i=1 and partial reference embedding as S r = {e x j } T j=1 in word or phrase level.The cost matrix C is then computed as C ij = c(ẽ y i , e x j ).The semantic partial matching loss between the reference and model generation can be computed via the IPOT algorithm: SEPAM Regularization SEPAM training objectives discussed above only focus on generating words with specific meanings and do not consider the word-ordering.To train a proper textgeneration model, we propose to combine the SEPAM loss with the likelihood loss L MLE in supervised settings or L AE in unsupervised settings.Hence, we have the training objective in unsupervised settings: L = L AE + λL SEM , where λ is the weight of SEPAM to be tuned.A similar objective applies for supervised settings: L = L MLE + λL SEM .In the following, we discuss how to extract and use prior knowledge for partial matching in three downstream tasks: (i) Non-parallel Style Transfer Semantic partial matching between the source sentence and the transferred one is helpful for content preservation, as shown in Figure 1.It is usually the case that the content words are nouns or verbs, and style words are adjectives or adverbs.Hence, M (X) and M ( Ŷ ) are content word sets, extracted based on the POS tags using NLTK.One can employ this prior knowledge to perform different operations for words: (i) for content words, we should encourage partial matching between the sentences by L SEM ; and (ii) for style words, we should discourage matching (Hu et al., 2017).More details are provided in Appendix A.1.
(ii) Unsupervised Image Captioning Visual concepts extracted from an image can be employed for generating relevant captions in the unsupervised setting.Feng et al. (2019) uses exactly hardmatching and REINFORCE to update the captioning model.Here, we apply the semantic partial matching to encourage the generatation of visualconcept words.Specifically, M (X) releases the visual concepts and M ( Ŷ ) corresponds to the generated words realted to the object (i.e., nouns).This visual concept regularization can also be applied in the supervised setting complementing with MLE loss.
(iii)  (Dhingra et al., 2019).M ( Ŷ ) extracts nouns from the generated sequence, and M (X) are keys in the table, then we used them to compute L SEM .Semantic partial matching will penalize the generator if extra information exists in the generated text Ŷ .

Related Work
Optimal transport in NLP Kusner et al. (2015) first applied optimal transport in NLP, and proposed the word mover's distance (WMD).The transportation cost is usually defined as Euclidean distance, making the OT distance approximated by solving a less-accurate lower bound (Kusner et al., 2015).Based on this, Chen et al. ( 2018) proposed a feature-mover distance for style transfer.Chen et al. (2019) applied OT for classical seqto-seq, and formulated it as Wasserstein gradient flows (Zhang et al., 2018).SEPAM moves forward and applies OT in both supervised and unsupervised settings (Artetxe et al., 2018;Lample et al., 2017).
Unsupervised Seq2Seq Learning Different from the standard Seq2Seq model (Sutskever et al., 2014), parallel sentences for different styles are not provided, and must be inferred from the data.Unsupervised machine translation (Artetxe et al., 2018;Lample et al., 2017) learns to translate text from one language to another with two sets of texts of these languages provided.Dai et al. (2019) explores the transformer model as the generator, instead of classical auto-regressive models.Style transfer (Shen et al., 2017) aims at transferring the styles of the texts with non-parallel data.Compared with these tasks, unsupervised image captioning (Feng et al., 2019) is more challenging since images and sentences are in distinct modalities.
Copying Mechanism This is related to the copy network (Gu et al., 2016), which achieves retrievedbased copying.Li et al. (2018) further proposed a delete-retrieve-generate framework for the style transfer.Chen and Bansal (2018)   of soft-copying mechanism.Instead of the retrievalbased exact copying used by (Gu et al., 2016;Li et al., 2018), SEPAM considers semantic similarity, and thus ideally delivers smoother transformation in generation.

Demonstration
Comparison between OT and SEPAM We show two examples of classical OT and SEPAM under two sequence-prediction tasks in Figure 3.The first row shows the heat map of OT and SEPAM on matching two sentences with different styles.SEPAM employs the syntax information to match selected words and all the content words are exactly matched.However, some sentiment words in classical OT are still matched, preventing successful style transfer.The second row in Figure 3 shows the comparison on matching a generated sentence with the detected concept set in image captioning.The concepts are perfectly matched with their corresponding words in the caption using SEPAM, while OT includes some noisy matching (light blue).In summary, SEPAM achieves better matching than classical OT, and the matching weights (T) of SEPAM is more sparse.
Implicit Use of Prior Knowledge We consider using the weights w t of attentions from a LSTMbased text classifier to determine which words to match.As discussed in Wiegreffe and Pinter (2019), a word with higher attention weight means it is more important for classification, i.e., more rel-evant to the style.As shown in Figure 4, shows the attention maps of three instances.Hence, we can partially match words with lower attention weights as they are mostly non-style words.However, empirical results show implicit ways have much worse results than the simple rule-based strategy with POS tags.
the atmosphere of the church is very fun .
overall I was very happy with the compensation I got .
but the smell was so horrible i will never go there again .

Unsupervised Text-style Transfer
Setup We use the same data and split method described in (Shen et al., 2017).Experiments are implemented with Tensorflow based on texar (Hu et al., 2018).For a fair comparison, we use a similar model configuration to that of (Hu et al., 2017;Yang et al., 2018).One-layer GRU (Cho et al., 2014) encoder and LSTM attention decoder (generator) are used.We set the weight of semantic matching loss as λ = 0.2.Models are trained for a total of 15 epochs, with 10 epochs for pretraining and 5 epochs for fine-tuning.
Metrics We pretrain a CNN classifier, which achieves an accuracy of 97.4% on the validation set.Based on it, we report the accuracy (ACC) to measure the quality of style control.We further measure the content preservation using i) BLEU, which measures the similarity between the original and transferred sentences ii) ref-BLEU, which measures content preservation comparing the transferred sentences with human annotations.Fluency is evaluated with perplexity (PPL) of generated sentences based on a pretrained language model.
Baselines We implemented CtrlGen (Hu et al., 2017) and LM (Yang et al., 2018) as our baselines and further added SEPAM on these two models for validation.Further, conditional variational encoder (CVAE) (Shen et al., 2017) and retrievalbased methods (Li et al., 2018) are added as two baselines.
Analysis Results are shown in Tables 1.It may be observed that combining our proposed method with corresponding baselines exhibits similar transfer accuracy and fluency, while maintaining the content better.Specifically, SEPAM shows higher BLEU scores on human annotated sentences, fur- they do not stock some of the most common parts .CtrlGen: they do fantastic laughed some of the most common parts .SEPAM+ CtrlGen: they do authentic expertly some of the most fascinating parts LM: they do definitely right some of the most cool parts .

SEPAM+ LM:
they do always stock some of the most amazing parts .Input: the woman who helped me today was very friendly and knowledgeable .CtrlGen: the woman who so-so me today was very rude and knowledgeable .SEPAM+ CtrlGen: the woman who helped me today was very rude and knowledgeable .

LM:
the woman who ridiculous me today was very rude and knowledgeable .

SEPAM+ LM:
the woman who helped me today was very rude and stupid .Human Evaluation We further conduct human evaluations for the proposed SEPAM using Ama-zon Mechnical Turk.We randomly sample 100 sentences from the test set and ask 5 different reviewers to provide their rating scores of the models in terms of fluency, style, and content preservation.
We require all the workers to be native English speakers, with approval rate higher than 95% and at least 100 assignments completed.For each sentence, five shuffled samples generated by different models are sequentially shown to a reviewer.Results in Table 3 demonstrate that the better performance achieved by SEPAM, especially in terms of content preservation.

Image Captioning
Setup We consider image captioning using the COCO dataset (Lin et al., 2014), which contains 123,287 images in total and each image is annotated with at least 5 captions.Following Karpathy's split (Karpathy and Fei-Fei, 2015), 113,287 images are used for training and 5,000 are used for validation and testing.We note that the training images are used to build the image set, with all the captions left unused for any training.All the descriptions in the Shutterstock image description corpus are tokenized with a vocabulary size of 18, 667 (Feng et al., 2019).The LSTM hidden dimension and the shared latent space dimension are fixed to 512.The weighting hyper-parameters are chosen to make different rewards roughly the same scale.Specifically, λ is set to 10.We train our model using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001.During the initialization process, we minimize the cross-entropy loss using Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001.When generating captions in the test phase, we use beam search with a beam size of 3.
Metrics We report BLEU (Papineni et al., 2002), CIDEr (Vedantam et al., 2015), and ME-TEOR (Banerjee and Lavie, 2005) scores.The results of different methods are shown in Table 5. OT can improve upon the baseline via generating specific words aligned with the detected visual concepts.However, directly applying it in unsupervised settings will suffer from the imbalance issue (Craig, 2014), i.e., the generated texts contains some useless elements without correspondence in the targets.Our proposed SEPAM can avoid this problem via partial matching, leading to better performance.
Extension to Supervised Settings Our proposed SEPAM L SEM can also be applied in a supervised setting as a regularizer with the MLE loss.We apply SEPAM in the captioning model, where image features are fed into an LSTM sequence generator with an Att2in attention mechanism (Anderson et al., 2018).We pretrain the captioning model for a maximum of 20 epochs, then use reinforcement learning to train it for another 20 epochs.
Testing is done on the best model with the validation set.We partially match the tags or visual features of detected objects.Similarly, we see consistent improvement of SEPAM over its baselines.

Table-to-Text Generation Setup
We evaluate SEPAM on table-to-text generation (Lebret et al., 2016;Liu et al., 2018;Wiseman et al., 2018) with the WikiPerson dataset (Wang et al., 2018) and preprocess the training set, with a vocabulary size of 50, 000.We use the transformer encoder and decoder.We set the number of heads as 8, the number of Transformer blocks as 3, the hidden units of the feed-forward layer as 2048 and λ = 0.1.Similarly, the model is first trained with L MLE for 20, 000 steps and then fine-tuning with L SEM .
Metrics For automatic evaluation, we apply the widely used evaluation metrics including the standard BLEU(-4) (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2015) scores to evaluate the generation quality.Following Dhingra et al. (2019), we evaluate with PARENT score on hallicination generation, which considers both the reference texts and table content in evaluations.
Analysis Results in Table 4 show consistent improvement of SEPAM over baselines in terms of different evaluation metrics.

Conclusions
We incorporate prior knowledge into optimal transport, to encourage partial-sentence matching via formulating it as an optimal partial transport problem.The proposed SEPAM shows broad applicability and consistent improvements against popular baselines in three downstream tasks: unsupervised style transfer for content preservation, image captioning for informative descriptions and table-totext generation for faithful generation.Further, the proposed technique can be regarded as a softcopying mechanism for Seq2Seq Models.

Figure 1 :
Figure 1: Semantic Matching between the potential target (top) and the generated texts (bottom).The left shows how to partially match two texts with different styles.The right shows how to partially match texts with concepts detected from the images.

Figure 2 :
Figure2: Overview of the SEPAM architecture.Left: classical Seq2Seq, i.e., X and Y are pair-wise; L SEM implements a soft-copying mechanism via semantic partial matching.Right: unsupervised Seq2Seq, i.e., X and Y are non-parallel; L SEM provides the guidance for G(•), to generate Ỹ relevant to X via semantic partial matching.Solid lines mean gradients are backpropagated in training; dash lines mean gradients are not backpropagated.

Figure 3 :
Figure 3: Optimal matching matrix visualization.A comparison between OT (left column) and SEPAM (right column).The Optimal matching matrix of SEPAM is sparse.The horizontal axis are the generated texts, and the vertical axis are the partial targets.

Figure 4 :
Figure 4: Attention maps for three Yelp instances.Larger attention weight corresponds to darker color.
Rooms were spacious and food was very well cooked .
Table-to-Text Generation Semantic partial matching can prevent hallucination generation, i.e., text mentions extra information than what is present in the table

Table 1 :
Our model and baselines performance on test dataset with human annotations.

Table 2 :
Examples for comparison of different methods on Yelp dataset.

Table 3 :
Human evaluation results on Yelp dataset.

Table 5 :
Performance comparisons of Unsupervised captioning on the MSCOCO dataset.

Table 4 :
Performance comparisons of Table-to-Text Generation on the WikiPerson.

Table 6 :
Performance comparisons of supervised image captioning results on the MSCOCO dataset.

Table 7 :
Xia Jin (born 14 February 1985 in Guizhou) is a Chinese Football player who currently plays for Guizhou Dangdai Lifan F.C. in the China League One . he joined Guizhou Dangdai Lifan F.C. in the summer of 2010 .he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 .he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 .he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 .he started his professional career with Chongqing Dangdai Lifan F.C. .OT: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Football player who currently plays for Guizhou Hengfeng F.C. in the China League One .jin started his professional footballer career with Guizhou Hengfeng F.C. in the Chinese Super League .jin would move to China League One side Chongqing Dangdai Lifan F.C. in February 2011 .he would move to China League Two side Chengdu Better City F.C. in January 2012 .he would move to China League Two side Chongqing Dangdai Lifan F.C. in January 2013.SEPAM: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Football player who currently plays for Guizhou Hengfeng F.C. in the China League One .Xia Jin started his professional footballer career with Chongqing Dangdai Lifan F.C. in the Chinese Super League .Xia transferred to China League One side Chengdu Better City F.C. .An example of Table-to-Text Generation. MLE: Table 7 shows an example of table-to-text generation.MLE hallucinates some information that does not appear in the table.OT alleviates this issue, but still shows hallucination since the imbalance transportation issue.SEPAM generates almost no extra information, and covers all the entries in the table.