Towards Unsupervised Language Understanding and Generation by Joint Dual Learning

In modular dialogue systems, natural language understanding (NLU) and natural language generation (NLG) are two critical components, where NLU extracts the semantics from the given texts and NLG is to construct corresponding natural language sentences based on the input semantic representations. However, the dual property between understanding and generation has been rarely explored. The prior work is the first attempt that utilized the duality between NLU and NLG to improve the performance via a dual supervised learning framework. However, the prior work still learned both components in a supervised manner; instead, this paper introduces a general learning framework to effectively exploit such duality, providing flexibility of incorporating both supervised and unsupervised learning algorithms to train language understanding and generation models in a joint fashion. The benchmark experiments demonstrate that the proposed approach is capable of boosting the performance of both NLU and NLG. The source code is available at: https://github.com/MiuLab/DuaLUG.


Introduction
Spoken dialogue systems that assist users to solve complex tasks such as booking a movie ticket have become an emerging research topic in artificial intelligence and natural language processing areas.With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions.Nowadays, there are several virtual intelligent assistants, such as Apple's Siri, Google Assistant, Microsoft's Cortana, and Amazon's Alexa.
The recent advance of deep learning has inspired many applications of neural dialogue systems (Wen et al., 2017;Bordes et al., 2017).A typical dialogue system pipeline can be divided into several components: a speech recognizer that transcribes a user's speech input into texts, a natural language understanding module (NLU) to classify the domain along with domain-specific intents and fill in a set of slots to form a semantic frame (Tur and De Mori, 2011;Hakkani-Tür et al., 2016).A dialogue state tracking (DST) module predicts the current dialogue state according to the multi-turn conversations, then the dialogue policy determines the system action for the next turn given the current dialogue state (Peng et al., 2018;Su et al., 2018a).Finally, the semantic frame indicating the policy is fed into a natural language generationt (NLG) module to construct a response utterance to the user (Wen et al., 2015b;Su et al., 2018b).
Generally, NLU is to extract core semantic concepts from the given utterances, while NLG is to construct corresponding sentences based on the given semantic representations.However, the dual property between understanding and generation has been rarely investigated, Su et al. (2019) first introduced the duality into the typical supervised learning schemes to train these two models.Different from the prior work, this paper proposes a general learning framework leveraging the duality between understanding and generation, providing flexibility of incorporating not only supervised but also unsupervised learning algorithms to jointly train NLU and NLG modules.The contributions can be summarized as 3-fold: • This paper proposes a general learning framework using the duality between NLU and NLG, where supervised and unsupervised learning can be flexibly incorporated for joint training.• This work is the first attempt to exploits the dual relationship between NLU and NLG towards unsupervised learning.• The benchmark experiments demonstrate the effectiveness of the proposed framework.
This paper focuses on modeling the duality between understanding and generation towards unsupervised learning of the two components, related work is summarized below.
Natural Language Understanding In dialogue systems, the first component is a natural language understanding (NLU) module-parsing user utterances into semantic frames that capture the core meaning (Tur and De Mori, 2011).A typical NLU first determines the domain given input utterances, predicts the intent, and then fill the associated slots (Hakkani-Tür et al., 2016;Chen et al., 2016).However, the above work focused on single-turn interactions, where each utterance is treated independently.To overcome the error propagation and further improve understanding performance, contextual information has been leveraged and shown useful (Chen et al., 2015;Sun et al., 2016;Shi et al., 2015;Weston et al., 2015).Also, different speaker roles provided informative signal for capturing speaking behaviors and achieving better understanding performance (Chen et al., 2017;Su et al., 2018c).
Natural Language Generation NLG is another key component in dialogue systems, where the goal is to generate natural language sentences conditioned on the given semantics from the dialogue manager.As an endpoint of interacting with users, the quality of generated sentences is crucial for better user experience.In spite of robustness and adequacy of the rule-based methods, poor diversity makes talking to a template-based machine unsatisfactory.Furthermore, scalability is an issue, because designing sophisticated rules for a specific domain is time-consuming.Previous work proposed a RNNLM-based NLG that can be trained on any corpus of dialogue act-utterance pairs without hand-crafted features and any semantic alignment (Wen et al., 2015a).The following work based on sequence-to-sequence (seq2seq) models further obtained better performance by employing encoder-decoder structure with linguistic knowledge such as syntax trees (Sutskever et al., 2014;Su et al., 2018b).
Dual Learning Various tasks may have diverse goals, which are usually independent to each other.However, some tasks may hold a dual form, that is, we can swap the input and target of a task to formu-late another task.Such structural duality emerges as one of the important relationship for further investigation.Two AI tasks are of structure duality if the goal of one task is to learn a function mapping from space X to Y, while the others goal is to learn a reverse mapping from Y and X .Machine translation is an example (Wu et al., 2016), translation from English to Chinese has a dual task, which is translated from Chinese to English; the goal of automatic speech recognition (ASR) is opposite to the one of text-to-speech (TTS) (Tjandra et al., 2017), and so on.Previous work first exploited the duality of the task pairs and proposed supervised (Xia et al., 2017) and unsupervised (reinforcement learning) (He et al., 2016)  However, although the duality has been considered into the learning objective, two models in previous work are still trained separately.In contrast, this work proposes a general learning framework that trains the models jointly, so that unsupervised learning methods in this research field can be better explored.

Proposed Framework
In this section, we describe the problem formulation and the proposed learning framework, which is illustrated in Figure 1.

Problem Formulation
The problems we aim to solve are NLU and NLG; for both tasks, there are two spaces: the semantics space X and the natural language space Y. NLG is to generate sentences associated with the given semantics, where the goal is to learn a mapping function f : X → Y that transforms semantic representations into natural language.On the other hand, NLU is to capture the core meaning of sentences, where the goal is to find a function g : Y → X that predicts semantic representations from the given natural language.
Given n data pairs {(x i , y i )} n i=1 i.i.d.sampled from the joint space X × Y.A typical strategy for the optimization problem is based on maximum likelihood estimation (MLE) of the parameterized conditional distribution by the trainable parameters θ x→y and θ y→x as below: The E2E NLG challenge dataset (Novikova et al., 2017) 2 is adopted in our experiments, which is a crowd-sourced dataset of 50k instances in the restaurant domain.Each instance is a pair of a semantic frame containing specific slots and corresponding values and a associated natural language utterance with the given semantics.For example, a semantic frame with the slot-value pairs "name[Bibimbap House], food[English], priceRange[moderate], area [riverside], near [Clare Hall]" corresponds to the target sentence "Bibimbap House is a moderately priced restaurant who's main cuisine is English food.You will find this local gem near Clare Hall in the Riverside area.".Although the original dataset is for NLG, of which the goal is to generate sentences based on the given slot-value pairs, we further formulate the NLU task as predicting slot-value pair based on the utterances, which can be viewed as a multi-label classification problem and each possible slot-value 2 http://www.macs.hw.ac.uk/InteractionLab/E2E/ pair is treated as an individual label.The formulation is similar to the prior work (Su et al., 2019).

Joint Dual Learning
Although previous work has introduced the learning schemes that exploit duality of AI tasks, most of it was based on reinforcement learning or standard supervised learning and the models of primal and dual tasks (f and g respectively) are trained separately.Intuitively, if the models of primal and dual tasks are optimally learned, a complete cycle of transforming data from the original space to another space then back to the original space should be exactly the same as the original data, which could be viewed as the ultimate goal of a dual problem.In our scenario, if we generate sentences from given semantics x via the function f and transform them back to the original semantics perfectly via the function g, it implies that our generated sentences are grounded to the original given semantics and has the mathematical condition: Therefore, our objective is to achieve the perfect complete cycle of data transforming by training two dual models (f and g) in a joint manner.

Algorithm Description
As illustrated in Figure 1, the framework is composed of two parts: Primal Cycle and Dual Cycle.Primal Cycle starts from semantic frames x, (1) first transforms the semantic representation to sentences by the function f , (2) then computes the loss by the given loss function l 1 , (3) predicts the semantic meaning from the generated sentences, Algorithm 1 Joint dual learning algorithm 1: Input: a mini-batch of n data pairs {(xi, yi)} n i=1 , the function of the primal task f , the function of the dual task g, the loss function for the primal task l1(.), the loss function for the dual task l2(.), and the learning rates γ1, γ2; 2: repeat 3: Start from data (4) computes the loss by the given loss function l 2 , (5) finally train the models based on the computed loss; Dual Cycle starts from utterances and is symmetrically formulated.The learning algorithm is described in Algorithm 1, which is agnostic to types of learning objective.Either a supervised learning objective or an unsupervised learning objective can be conducted at the end of the training cycles, and the whole framework can be trained in an end-to-end manner.

Learning Objective
As the language understanding task in our experiments is to predict corresponding slot-value pairs of utterances, which is a multi-label classification problem, we utilized the binary cross entropy loss as the supervised objective function for NLU.Likewise, the cross entropy loss function is used as the supervised objective for NLG.Take NLG for example, the objective of the model is to optimize the conditional probability of predicting word tokens given semantics p(y | x), so that the difference between the predicted distribution and the target distribution, q(y | x), can be minimized: where n is the number of samples.
On the other hand, we can also introduce the reinforcement learning objective into our framework, the objective aims to maximize the expected value of accumulated reward.In our experiments, we conduct policy gradient (REINFORCE) method (Sutton et al., 2000) for optimization, the gradient could be written as: where the variety of reward r will be elaborated in the next section.The loss function l 1 for both tasks could be (1), ( 2), and the combination of them.

Reward Function
Different types of rewards reflect various objectives and would result in different behaviors in the learned policy.Hence, we design various reward functions to explore the model behavior, including explicit and implicit feedback.

Explicit Reward
To evaluate the quality of generated sentences, two explicit reward functions are adopted.
Reconstruction Likelihood In our scenario, if we generate sentences based on given semantics x by the function f and could transform them back to the original semantics perfectly by the function g, it implies our generated sentences ground on the original given semantics.Therefore we use the reconstruction likelihood at the end of the training cycles as a reward function: Automatic Evaluation Score The goal of most NLP tasks is to predict word tokens correctly, so the loss functions used to train these models focus on the word level, such as cross entropy maximizing the continuous probability distribution of the next correct word given the preceding context.
However, the performance of these models is typically evaluated using discrete metrics.For instance, BLEU and ROUGE measure n-gram overlaps between the generated outputs and the reference texts.
In order to enforce our NLG to generate better results in terms of the evaluation metrics, we utilize these automatic metrics as rewards to provide the sentence-level information.Moreover, we also leverge F-score in our NLU model to indicate the understanding performance.

Implicit Reward
In addition to explicit signals like reconstruction likelihood and the automatic evaluation metrics, a "softer" feedback signal may be informative.For both tasks, we design model-based methods estimating data distribution in order to provide such soft feedback.
Language Model For NLG, we utilize pretrained language models which estimate the whole data distribution to compute the joint probability of generated sentences, measuring their naturalness and fluency.In this work, we use a simple language model based on RNN (Mikolov et al., 2010;Sundermeyer et al., 2012).The language model is learned by a cross entropy objective in an unsupervised manner: where y (•) are the words in a sentence y, and L is the length of the utterance.
Masked Autoencoder for Distribution Estimation (MADE) For NLU, the output contains a set of discrete labels, which do not fit the sequential model scenarios such as language models.Each semantic frame x in our work contains the core concept of a certain sentence, furthermore, the slotvalue pairs are not independent to others, because they correspond to the same individual utterance.For example, McDonald's would probably be inexpensive; therefore the correlation should be taken into account when estimating the joint distribution.Following Su et al. (2019), we measure the soft feedback signal for NLU using masked autoencoder (Germain et al., 2015) to estimate the joint distribution.By interrupting certain connections between hidden layers, we could enforce the variable unit x d to only depend on any specific set of variables, not necessary on x <d ; eventually we could still have the joint distribution by product rule: where d is the index of variable unit, D is the total number of variables, and S d is a specific set of variable units.Because there is no explicit rule specifying the exact dependencies between slot-value pairs in our data, we consider various dependencies by ensembles of multiple decomposition by sampling different sets S d and averaging the results.

Flexibility of Learning Scheme
The proposed framework provides various flexibility of designing and extending the learning scheme, described as follows.
Straight-Through Estimator In many NLP tasks, the learning targets are discrete, so the goals of most NLP tasks are predicting discrete labels such as words.In practice we perform argmax operations on the output distribution from learned models to select the most possible candidates.However, such operation does not have any gradient value, forbidding the networks be trained via backpropagation.Therefore, it is difficult to directly connect a primal task (NLU in our scenario) and a dual task (NLG in our scenario) and jointly train these two models due to the above issue.
The Straight-Through (ST) estimator (Bengio et al., 2013) is a widely applied method due to its simplicity and effectiveness.The idea of Straight-Through estimator is directly using the gradients of discrete samples as the gradients of the distribution parameters.Because discrete samples could be generated as the output of hard threshold functions or some operations on the continuous distribution, Bengio et al. (2013) explained the estimator by setting the gradients of hard threshold functions to 1.In this work, we introduce ST estimator for connecting two models, and therefore the gradient can be estimated and two models can be jointly trained in an end-to-end manner.
Distribution as Input In addition to employing the Straight-Through estimator, an alternative solution is to use continuous distribution as the input of models.For NLU, the inputs are the word tokens from NLG, so we use the predicted distribution over the vocabulary to perform the weighted-sum of word embeddings.For NLG, the model requires semantic frame vectors predicted by NLU as the input condition; in this case, the probability distribution of slot-value pairs predicted by NLU can directly serve as the input vector.By utilizing the output distribution in this way, two models can be trained jointly in an end-to-end fashion.
Hybrid Objective As described before, the proposed approach is agnostic to learning algorithms; in other words, we could apply different learning algorithms at the middle and end of the cycles.For example, we could apply supervised learning on NLU in the first half of Primal Cycle and reinforcement learning on NLG to form a hybrid training cycle.Because two models are trained jointly, the objective applied on one model would potentially impact on the behavior of the other.Furthermore, we could also apply multiple objective functions including supervised or unsupervised ones to formulate multi-objective learning schemes.
Towards Unsupervised Learning Because the whole framework can be trained jointly and propagate the gradients, we could apply only one objective in one learning cycle at the end of it.Specifically, in Algorithm 1, we can apply only l 2 in line 8 and only l 1 in line 15.Such flexibility potentially enables us to train the models based on unpaired data in a unsupervised manner.For example, sample unpaired data x and transform the data by function f , next, feed them into the function g, then compare the predicted results and the original input to compute the loss.Likewise, we can perform the training cycle symmetrically from y.It is also possible to utilize limited data and perform the autoencoding cycle described above to apply semi-supervised learning.

Experiments
Our models are trained on the official training set and verified on the official testing set of the E2E NLG challenge dataset (Novikova et al., 2017).The data preprocessing includes trimming punctuation marks, lemmatization, and turning all words into lowercase.Each possible slot-value pair is treated as an individual label and the total number of labels is 79.To evaluate the quality of the generated sequences regarding both precision and recall, for NLG, the evaluation metrics include BLEU and ROUGE (1, 2, L) scores with multiple references, while F1 measure is reported for evaluating NLU.

Model
The proposed framework and algorithm are agnostic to model structures.In our experiments, we use a gated recurrent unit (GRU) (Cho et al., 2014) with fully-connected layers at ends of GRU for both NLU and NLG, which are illustrated in the right part of Figure 1.Thus the models may have semantic frame representation as initial and final hidden states and sentences as the sequential input.In all experiments, we use mini-batch Adam as the optimizer with each batch of 64 examples.10 training epochs were performed without early stop, the hidden size of network layers is 200, and word embedding is of size 50.

Results and Analysis
The experimental results are shown in Table 1, each reported number is averaged on the official testing set from three turns.Row (a) is the baseline where NLU and NLG models are trained independently and separately by supervised learning.The best performance in (Su et al., 2019) is reported in row (b), where NLU and NLG are trained separately by supervised learning with regularization terms exploiting the duality.
To overcome the issue of non-differentiability, we introduce Straight-Through estimator when connecting two tasks.Based on our framework, another baseline for comparison is to train two models jointly by supervised loss and straight-through estimators, of which the performance is reported in row (c).Specifically, the cross entropy loss (1) is utilized in both l 1 and l 2 in Algorithm 1.Because the models in the proposed framework are trained jointly, the gradients are able to flow through the whole network thus two models would directly influence learning of each other.Rows (d)-(f) show the ablation experiments for exploring the interaction between two models (f and g).For instance, row (e) does not use ST at the output of the NLU module; instead, we feed continuous distribution over slot-value labels instead of discrete semantic frames into NLG as the input.Instead of discrete word labels, row (d) and row (f) feed weighted sum over word embeddings based on output distributions.Since the goal of NLU is to learn a many-to-one function, considering all possibility would potentially benefit learning (row (d)-(f)).
On the contrary, the goal of NLG is to learn a one-to-many function, applying the ST estimator at the output of NLU only rather than both sides degrades the performance of generation (row (e)).However, this model achieves unexpected improvement in understanding by over 10%, the reason may be the following.The semantics representation is very compact, a slight noise in the semantics space would possibly result in a large difference in the target space and a totally different semantic meaning.Hence the continuous distribution over slot-value pairs may potentially cover the unseen mixture of semantics and further provide rich gradient signals.This could also be explained from the perspective of data augmentation.Moreover, connecting two models with continuous distribution at both joints further achieves improvement in both NLU and NLG (row (f)).Although row (f) performs best in our experiments and dataset, as most AI tasks are classification problems, the proposed framework with ST estimators provides a general way to connect two tasks with duality.The proposed methods also significantly outperform the previously proposed dual supervised learning framework (Su et al., 2019) on F1 score of NLU and BLEU score of NLG, demonstrating the benefit of learning NLU and NLG jointly.

Investigation of Hybrid Objectives
The proposed framework provides the flexibility of applying multiple objectives and different types of learning methods.In our experiments, apart from training two models jointly by supervised loss, reinforcement learning objectives are also incorporated into the training schemes (row (g)-(l)).The ultimate goal of reinforcement learning is to maximize the expected reward (equation ( 2)).In the proposed dual framework, if we take expectation over different distribution, it would reflect a different physical meaning.For instance, if we receive a reward at the end of Primal Cycle and the expectation is taken over the output distribution of NLG (middle) or NLU (end), the derivatives of objective functions would differ: The upper one (RL mid ) assesses the expected reward earned by the sentences constructed by the policy of NLG, which is a direct signal for the primal task NLG.The lower one (RL end ) estimates the expected reward earned by the predicted semantics by the policy of NLU based on the state predicted by NLG, such reward is another type of feedback.
In the proposed framework, the models of two tasks are trained jointly, thus an objective function will simultaneously influence the learning of both models.Different reward designs could guide reinforcement learning agents to different behaviors.To explore the impact of reinforcement learning signal, various rewards are applied on top of the joint framework (row (f)): 1. token-level likelihood (rows (g) and (h)), 2. sentence/frame-level automatic evaluation metrics (rows (i) and (j)), 3. corpus-level joint distribution estimation (rows (k) and (l)).In other words, the models in rows (g)-(l) have both supervised and reinforcement learning signal.The results show that token-level feedback may not   provide extra guidance (rows (g) and (h)), directly optimizing towards the evaluation metrics at the testing phase benefits learning in both tasks and performs best (rows (i) and (j)), and the models utilizing learning-based joint distribution estimation also obtain improvement (row (k)).In sum, the explicit feedback is more useful for boosting the NLG performance, because the reconstruction and automatic scores directly reflect the generation quality.However, the implicit feedback is more informative for improving NLU, where MADE captures the salient information for building better NLU models.The results align well with the finding in Su et al. (2019).

Qualitative Analysis
Table 2 and 3 2), we can find that f (g(y; θ y→x )); θ x→y ) equals x, which means the proposed method can successfully restore the original semantics.On the other hand, Dual Cycle starts from natural language utterances, from the generated results (Table 3) we can find that our proposed method would not lose semantic concepts in the middle of the training cycle (g(y; θ y→x )) ↔ x).
Based on the qualitative analysis, we can find that by considering the duality into the objective and jointly training, the proposed framework can improve the performance of NLU and NLG simultaneously.

Future Work
Though theoretically sound and empirically validated, the formulation of the proposed framework depends on the characteristics of data.Not every NLU dataset is suitable for being used as a NLG task, vice versa.Moreover, though the proposed framework provides possibility of training the two models in a fully-unsupervised manner, it is found unstable and hard to optimize from our experiments.Thus, better dual learning algorithms and leveraging pretrained models and other learning techniques like adversarial learning are worth exploring to improve our framework, we leave these in our future work.

Conclusion
This paper proposes a general learning framework leveraging the duality between language understanding and generation, providing flexibility of incorporating supervised and unsupervised learning algorithm to jointly train two models.Such framework provides a potential method towards unsupervised learning of both language understanding and generation models by considering their data distribution.The experiments on benchmark dataset demonstrate that the proposed approach is capable of boosting the performance of both NLU and NLG models.
show the selected examples of the proposed model and the baseline model in Primal and Dual Cycle.As depicted in Algorithm 1, Primal Cycle is designed to start from semantic frames x, then transform the representation by the NLG model f , finally feed the generated sentences into the NLU model g and compare the results with the original input to compute loss.In the example of Primal Cycle (Table The proposed joint dual learning framework, which comprises Primal Cycle and Dual Cycle.The framework is agnostic to learning objectives and the algorithm is detailed in Algorithm 1. Right: In our experiments, the models for NLG and NLU are a GRU unit accompanied with a fully-connected layer.

Table 1 :
The NLU performance reported on micro-F1 and the NLG performance reported on BLEU, ROUGE-1, ROUGE-2, and ROUGE-L of models (%).

Table 2 :
An example of Primal Cycle, where the baseline model is row (a) in Table1.

Table 3 :
An example of Dual Cycle, where the baseline model is row (a) in Table1.