Dual Supervised Learning for Natural Language Understanding and Generation

Natural language understanding (NLU) and natural language generation (NLG) are both critical research topics in the NLP and dialogue fields. Natural language understanding is to extract the core semantic meaning from the given utterances, while natural language generation is opposite, of which the goal is to construct corresponding sentences based on the given semantics. However, such dual relationship has not been investigated in literature. This paper proposes a novel learning framework for natural language understanding and generation on top of dual supervised learning, providing a way to exploit the duality. The preliminary experiments show that the proposed approach boosts the performance for both tasks, demonstrating the effectiveness of the dual relationship.


Introduction
Spoken dialogue systems that can help users solve complex tasks such as booking a movie ticket have become an emerging research topic in artificial intelligence and natural language processing areas.With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions.The recent advance of deep learning has inspired many applications of neural dialogue systems (Wen et al., 2017;Bordes et al., 2017;Dhingra et al., 2017;Li et al., 2017).A typical dialogue system pipeline can be divided into several parts: 1) a speech recognizer that transcribes a user's speech input into texts, 2) a natural language understanding module (NLU) that classifies the domain and associated intents and fills slots to form a semantic frame (Chi et al., 2017;Chen et al., 2017;Zhang et al., 2018;Su et al., 2018c,

Natural Language Generation Natural Language
McDonald's is a cheap restaurant nearby the station.

Semantic Frame
RESTAURANT="McDonald's" PRICE="cheap" LOCATION= "nearby the station" Figure 1: NLU and NLG emerge as a dual form.
2019), 3) a dialogue state tracker (DST) that predicts the current dialogue state in the multi-turn conversations, 4) a dialogue policy that determines the system action for the next step given the current state (Peng et al., 2018;Su et al., 2018a), and 5) a natural language generator (NLG) that outputs a response given the action semantic frame (Wen et al., 2015;Su et al., 2018b;Su and Chen, 2018).Many artificial intelligence tasks come with a dual form; that is, we could directly swap the input and the target of a task to formulate another task.Machine translation is a classic example (Wu et al., 2016); for example, translating from English to Chinese has a dual task of translating from Chinese to English; automatic speech recognition (ASR) and text-to-speech (TTS) also have structural duality (Tjandra et al., 2017).Previous work first exploited the duality of the task pairs and proposed supervised (Xia et al., 2017) and unsupervised (reinforcement learning) (He et al., 2016) training schemes.The recent studies magnified the importance of the duality by boosting the performance of both tasks with the exploitation of the duality.
NLU is to extract core semantic concepts from the given utterances, while the goal of NLG is to construct corresponding sentences based on given semantics.In other words, understanding and generating sentences are a dual problem pair shown in Figure 1.In this paper, we introduce a novel train-arXiv:1905.06196v4[cs.CL] 30 Apr 2020 ing framework for NLU and NLG based on dual supervised learning (Xia et al., 2017), which is the first attempt at exploiting the duality of NLU and NLG.The experiments show that the proposed approach improves the performance for both tasks.

Proposed Framework
This section first describes the problem formulation, and then introduces the core training algorithm along with the proposed methods of estimating data distribution.
Assuming that we have two spaces, the semantics space X and the natural language space Y, given n data pairs {(x i , y i )} n i=1 , the goal of NLG is to generate corresponding utterances based on given semantics.In other words, the task is to learn a mapping function f (x; θ x→y ) to transform semantic representations into natural language.On the other hand, NLU is to capture the core meaning of utterances, finding a function g(y; θ y→x ) to predict semantic representations given natural language.A typical strategy of these optimization problems is based on maximum likelihood estimation (MLE) of the parameterized conditional distribution by the learnable parameters θ x→y and θ y→x .

Dual Supervised Learning
Considering the duality between two tasks in the dual problems, it is intuitive to bridge the bidirectional relationship from a probabilistic perspective.If the models of two tasks are optimal, we have probabilistic duality: where P (x) and P (y) are marginal distributions of data.The condition reflects parallel, bidirectional relationship between two tasks in the dual problem.Although standard supervised learning with respect to a given loss function is a straightforward approach to address MLE, it does not consider the relationship between two tasks.Xia et al. (2017) exploited the duality of the dual problems to introduce a new learning scheme, which explicitly imposed the empirical probability duality on the objective function.The training strategy is based on the standard supervised learning and incorporates the probability duality constraint, so-called dual supervised learning.There-fore the training objective is extended to a multiobjective optimization problem: where l 1,2 are the given loss functions.Such constraint optimization problem could be solved by introducing Lagrange multiplier to incorporate the constraint: where λ x→y and λ y→x are the Lagrange parameters and the constraint is formulated as follows: Now the entire objective could be viewed as the standard supervised learning with an additional regularization term considering the duality between tasks.Therefore, the learning scheme is to learn the models by minimizing the weighted combination of an original loss term and a regularization term.Note that the true marginal distribution of data P (x) and P (y) are often intractable, so here we replace them with the approximated empirical marginal distribution P (x) and P (y).

Distribution Estimation as Autoregression
With the above formulation, the current problem is how to estimate the empirical marginal distribution P (•).To accurately estimate data distribution, the data properties should be considered, because different data types have different structural natures.For example, natural language has sequential structures and temporal dependencies, while other types of data may not.Therefore, we design a specific method of estimating distribution for each data type based on the expert knowledge.
From the probabilistic perspective, we can decompose any data distribution p(x) into the product of its nested conditional probability, where x could be any data type and d is the index of a variable unit.

Language Modeling
Natural language has an intrinsic sequential nature; therefore it is intuitive to leverage the autoregressive property to learn a language model.In this work, we learn the language model based on recurrent neural networks (Mikolov et al., 2010;Sundermeyer et al., 2012) by the cross entropy objective in an unsupervised manner.
where y (•) are words in the sentence y, and L is the sentence length.

Masked Autoencoder
The Even though the product rule in (1) enables us to decompose any probability distribution into a product of a sequence of conditional probability, how we decompose the distribution reflects a specific physical meaning.For example, language modeling outputs the probability distribution over vocabulary space of i-th word y i by only taking the preceding word sequence y <i .Natural language has the intrinsic sequential structure and temporal dependency, so modeling the joint distribution of words in a sequence by such autoregressive property is logically reasonable.However, slot-value pairs in semantic frames do not have a single directional relationship between them, while they parallel describe the same sentence, so treating a semantic frame as a sequence of slot-value pairs is not suitable.Furthermore, slot-value pairs are not independent, because the pairs in a semantic frame correspond to the same individual utterance.For example, French food would probably cost more.Therefore, the correlation should be taken into account when estimating the joint distribution.Considering the above issues, to model the joint distribution of flat semantic frames, various dependencies between slot-value semantics should be leveraged.In this work, we propose to utilize a masked autoencoder for distribution estimation (MADE) (Germain et al., 2015).By zeroing certain connections, we could enforce the variable unit x d to only depend on any specific set of variables, not necessary on x <d ; eventually we could still have the marginal distribution by the product rule: where S d is a specific set of variable units.
In practice, we elementwise-multiply each weight matrix by a binary mask matrix M to interrupt some connections, as illustrated in Figure 2. To impose the autoregressive property, we first assign each hidden unit k an integer m(k) ranging from 1 to the dimension of data D − 1 inclusively; for the input and output layers, we assign each unit a number ranging from 1 to D exclusively.Then binary mask matrices can be built as follows: Here l indicates the index of the hidden layer, and L indicates the one of the output layer.With the constructed mask matrices, the masked autoencoder is shown to be able to estimate the joint distribution as autoregression.Because there is no explicit rule specifying the exact dependencies between slot-value pairs in our data, we consider various dependencies by ensemble of multiple decomposition, that

Experiments
To evaluate the effectiveness of the proposed framework, we conduct the experiments, the settings and analysis of the results are described as follows.

Settings
The experiments are conducted in the benchmark E2E NLG challenge dataset (Novikova et al., 2017), which is a crowd-sourced dataset of 50k instances in the restaurant domain.Our models are trained on the official training set and verified on the official testing set.Each instance is a pair of a semantic frame containing specific slots and corresponding values and an associated natural language utterance with the given semantics.
The data preprocessing includes trimming punctuation marks, lemmatization, and turning all words into lowercase.
Although the original dataset is for NLG, of which the goal is to generate sentences based on the given slot-value pairs, we further formulate a NLU task as predicting slot-value pairs based on the utterances, which is a multi-label classification problem.Each possible slot-value pair is treated as an individual label, and the total number of labels is 79.To evaluate the quality of the generated sequences regarding both precision and recall, for NLG, the evaluation metrics include BLEU and ROUGE (1, 2, L) scores with multiple references, while F1 score is measured for the NLU results.

Model Details
The model architectures for NLG and NLU are a gated recurrent unit (GRU) (Cho et al., 2014) with two identical fully-connected layers at the two ends of GRU.Thus the model is symmetrical and may have semantic frame representation as initial and final hidden states and sentences as the sequential input.
In all experiments, we use mini-batch Adam as the optimizer with each batch of 64 examples, 10 training epochs were performed without early stop, the hidden size of network layers is 200, and word embedding is of size 50 and trained in an end-to-end fashion.

Results and Analysis
The experimental results are shown in Table 1, where each reported number is averaged over three runs.The row (a) is the baseline that trains NLU and NLG separately and independently, and the rows (b)-(d) are the results from the proposed approach with different Lagrange parameters.
The proposed approach incorporates probability duality into the objective as the regularization term.To examine its effectiveness, we control the intensity of regularization by adjusting the Lagrange parameters.The results (rows (b)-(d)) show that the proposed method outperforms the baseline on all automatic evaluation metrics.Furthermore, the performance improves more with stronger regularization (row (b)), demonstrating the importance of leveraging duality.
In this paper, we design the methods for estimating marginal distribution for data in NLG and NLU tasks: language modeling is utilized for sequential data (natural language utterances), while the masked autoencoder is conducted for flat representation (semantic frames).The proposed method for estimating the distribution of semantic frames considers complex and implicit dependencies between semantics by ensemble of multiple decomposition of joint distribution.In our experiments, the empirical marginal distribution is the average over the results from 10 different masks and orders; in other words, 10 types of dependencies are modeled.The row (e) can be viewed as the ablation test, where the marginal distribution of semantic frames is estimated by considering slotvalue pairs independent to others and statistically computed from the training set.The performance is worse than the ones that model the dependencies, demonstrating the importance of considering the nature of input data and modeling data distribution via the masked autoencoder.
We further analyze understanding and generation results compared with the baseline model.In some cases, it is found that our NLU model can extract the semantics of utterances better and our NLU model can generate sentences with richer information based on the proposed learning scheme.In sum, the proposed approach is capable of improving the performance of both NLU and NLG in the benchmark data, where the exploitation of duality and the way of estimating distribution are demonstrated to be important.

Conclusion
This paper proposes a novel training framework for natural language understanding and generation based on dual supervised learning, which first exploits the duality between NLU and NLG and introduces it into the learning objective as the regularization term.Moreover, expert knowledge is incorporated to design suitable approaches for estimating data distribution.The proposed methods demonstrate effectiveness by boosting the performance of both tasks simultaneously in the benchmark experiments.

Figure 2 :
Figure 2: The illustration of the masked autoencoder for distribution estimation (MADE).

Table 1 :
is, to sample different sets S d .The NLU performance reported on micro-F1 and the NLG performance reported on BLEU, ROUGE-1, ROUGE-2, and ROUGE-L of models (%).