A Generative Model for Joint Natural Language Understanding and Generation

Natural language understanding (NLU) and natural language generation (NLG) are two fundamental and related tasks in building task-oriented dialogue systems with opposite objectives: NLU tackles the transformation from natural language to formal representations, whereas NLG does the reverse. A key to success in either task is parallel training data which is expensive to obtain at a large scale. In this work, we propose a generative model which couples NLU and NLG through a shared latent variable. This approach allows us to explore both spaces of natural language and formal representations, and facilitates information sharing through the latent space to eventually benefit NLU and NLG. Our model achieves state-of-the-art performance on two dialogue datasets with both flat and tree-structured formal representations. We also show that the model can be trained in a semi-supervised fashion by utilising unlabelled data to boost its performance.


Introduction
Natural language understanding (NLU) and natural language generation (NLG) are two fundamental tasks in building task-oriented dialogue systems. In a modern dialogue system, an NLU module first converts a user utterance, provided by an automatic speech recognition model, into a formal representation. The representation is then consumed by a downstream dialogue state tracker to update a belief state which represents an aggregated user goal. Based on the current belief state, a policy network decides the formal representation of the system response. This is finally used by an NLG module to generate the system response (Young et al., 2010).
It can be observed that NLU and NLG have opposite goals: NLU aims to map natural language * * Work done while the author was an intern at Apple. Figure 1: Generation and inference process in our model, and how NLU and NLG are achieved. x and y denotes utterances and formal representations respectively; z represents the shared latent variable for x and y.
to formal representations, while NLG generates utterances from their semantics. In research literature, NLU and NLG are well-studied as separate problems. State-of-the-art NLU systems tackle the task as classification (Zhang and Wang, 2016) or as structured prediction or generation (Damonte et al., 2019), depending on the formal representations which can be flat slot-value pairs (Henderson et al., 2014), first-order logical form (Zettlemoyer and Collins, 2012), or structured queries (Yu et al., 2018;Pasupat et al., 2019). On the other hand, approaches to NLG vary from pipelined approach subsuming content planning and surface realisation (Stent et al., 2004) to more recent end-to-end sequence generation (Wen et al., 2015;Dušek et al., 2020).
However, the duality between NLU and NLG has been less explored. In fact, both tasks can be treated as a translation problem: NLU converts natural language to formal language while NLG does the reverse. Both tasks require a substantial amount of utterance and representation pairs to succeed, and such data is costly to collect due to the complexity of annotation involved. Although unannotated data for either natural language or formal representations can be easily obtained, it is less clear how they can be leveraged as the two languages stand in different space.
In this paper, we propose a generative model for Joint natural language Understanding and Generation (JUG), which couples NLU and NLG with a latent variable representing the shared intent between natural language and formal representations. We aim to learn the association between two discrete spaces through a continuous latent variable which facilitates information sharing between two tasks. Moreover, JUG can be trained in a semi-supervised fashion, which enables us to explore each space of natural language and formal representations when unlabelled data is accessible. We examine our model on two dialogue datasets with different formal representations: the E2E dataset (Novikova et al., 2017) where the semantics are represented as a collection of slot-value pairs; and a more recent weather dataset (Balakrishnan et al., 2019) where the formal representations are tree-structured. Experimental results show that our model improves over standalone NLU/NLG models and existing methods on both tasks; and the performance can be further boosted by utilising unlabelled data.

Model
Our key assumption is that there exists an abstract latent variable z underlying a pair of utterance x and formal representation y. In our generative model, this abstract intent guides the standard conditional generation of either NLG or NLU ( Figure  1a). Meanwhile, z can be inferred from either utterance x, or formal representation y (Figure 1b). That means performing NLU requires us to infer the z from x, after which the formal representation y is generated conditioning on both z and x (Figure 1c), and vice-versa for NLG (Figure 1d). In the following, we will explain the model details, starting with NLG.

NLG
As mentioned above, the task of NLG requires us to infer z from y, and then generate x using both z and y. We choose the posterior distribution q(z|y) to be Gaussian. The task of inferring z can then be recast to computing mean µ and standard deviation σ of the Gaussian distribution using an NLG encoder. To do this, we use a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) to encode formal representation y. which is linearised and represented as a sequence of symbols. After encoding, we obtain a list of hidden vectors H, with each representing the concatenation of forward and backward LSTM states. These hidden vectors are then average-pooled and passed through two feedforward neural networks to compute mean µ µ µ y,z and standard deviation σ σ σ y,z vectors of the posterior q(z|y).
where W and b represent neural network weights and bias. Then the latent vector z can be sampled from the approximated posterior using the re-parameterisation trick of Kingma and Welling (2013): The final step is to generate natural language x based on latent variable z and formal representation y. We use an LSTM decoder relying on both z and y via attention mechanism (Bahdanau et al., 2014). At each time step, the decoder computes: where ⊕ denotes concatenation. x i−1 is the word vector of input token; g x i is the corresponding decoder hidden state and p(x i ) is the output token distribution at time step i.

NLU
NLU performs the reverse procedures of NLG. First, an NLU encoder infers the latent variable z from utterance x. The encoder uses a bi-directional LSTM to convert the utterance into a list of hidden states. These hidden states are pooled and passed through feed-forward neural networks to compute the mean µ µ µ x,z and standard deviation σ σ σ x,z of the posterior q(z|x). This procedure follows Equation 1 in NLG.
However, note that a subtle difference between natural language and formal language is that the former is ambiguous while the later is precisely defined. This makes NLU a many-to-one mapping problem but NLG is one-to-many. To better reflect the fact that the NLU output requires less variance, when decoding we choose the latent vector z in NLU to be the mean vector µ µ µ x,z , instead of sampling it from q(z|x) like Equation 2. 1 After the latent vector is obtained, the formal representation y is predicted from both z and x using an NLU decoder. Since the space of y depends on the formal language construct, we consider two common scenarios in dialogue systems. In the first scenario, y is represented as a set of slot-value pairs, e.g., {food type=British, area=north} in restaurant search domain (Mrkšić et al., 2017). The decoder here consists of several classifiers, one for each slot, to predict the corresponding values. 2 Each classifier is modelled by a 1-layer feed-forward neural network that takes z as input: where p(y s ) is the predicted value distribution of slot s. In the second scenario, y is a tree-structured formal representation (Banarescu et al., 2013). We then generate y as a linearised token sequence using an LSTM decoder relying on both z and x via the standard attention mechanism (Bahdanau et al., 2014). The decoding procedure follows exactly Equation 3.

Model Summary
One flexibility of the JUG model comes from the fact that it has two ways to infer the shared latent variable z through either x or y; and the inferred z can aid the generation of both x and y. In this next section, we show how this shared latent variable enables the JUG model to explore unlabelled x and y, while aligning the learned meanings inside the latent space.

Optimisation
We now describe how JUG can be optimised with a pair of x and y ( §3.1), and also unpaired x or y ( §3.2). We specifically discuss the prior choice of JUG objectives in §3.3. A combined objective can be thus derived for semi-supervised learning: a practical scenario when we have a small set of labelled data but abundant unlabelled ones ( §3.4).

Optimising p(x, y)
Given a pair of utterance x and formal representation y, our objective is to maximise the loglikelihood of the joint probability p(x, y): The optimisation task is not directly tractable since it requires us to marginalise out the latent variable z. However, it can be solved by following the standard practice of neural variational inference (Kingma and Welling, 2013). An objective based on the variational lower bound can be derived as where the first term on the right side is the NLU model; the second term is the reconstruction of x; and the last term denotes the Kullback−Leibler divergence between the approximate posterior q(z|x) with the prior p(z). We defer the discussion of prior to Section 3.3 and detailed derivations to Appendix.
The symmetry between utterance and semantics offers an alternative way of inferring the posterior through the approximation q(z|y). Analogously we can derive a variational optimisation objective: where the first term is the NLG model; the second term is the reconstruction of y; and the last term denotes the KL divergence. It can be observed that our model has two posterior inference paths from either x or y, and also two generation paths. All paths can be optimised.

Optimising p(x) or p(y)
Additionally, when we have access to unlabelled utterance x (or formal representation y), the optimisation objective of JUG is the marginal likelihood p(x) (or p(y)): Note that both z and y are unobserved in this case. We can develop an objective based on the variational lower bound for the marginal: where the first term is the auto-encoder reconstruction of x with a cascaded NLU-NLG path. The second term is the KL divergence which regularizes the approximated posterior distribution. Detailed derivations can be found in Appendix. When computing the reconstruction term of x, it requires us to first run through the NLU model to obtain the prediction on y, from which we run through NLG to reconstruct x. The full information flow is (x → z → y → z → x). 3 Connections can be drawn with recent work which uses backtranslation to augment training data for machine translation (Sennrich et al., 2016;He et al., 2016). Unlike back-translation, the presence of latent variable in our model requires us to sample z along the NLU-NLG path. The introduced stochasticity allows the model to explore a larger area of the data manifold.
The above describes the objectives when we have unlabelled x. We can derive a similar objective for leveraging unlabelled y: where the first term is the auto-encoder reconstruction of y with a cascaded NLG-NLU path. The full information flow here is (y→z→x→z→y).

Choice of Prior
The objectives described in 3.1 and 3.2 require us to match an approximated posterior (either q(z|x) or q(z|y)) to a prior p(z) that reflects our belief. A common choice of p(z) in the research literature is the Normal distribution (Kingma and Welling, 2013). However, it should be noted that even if we match both q(z|x) and q(z|y) to the same prior, it does not guarantee that the two inferred posteriors are close to each other; this is a desired property of the shared latent space.
To better address the property, we propose a novel prior choice: when the posterior is inferred from x (i.e., q(z|x)), we choose the parameterised distribution q(z|y) as our prior belief of p(z). Similarly, when the posterior is inferred from y (i.e., q(z|y)), we have the freedom of defining p(z) to be q(z|x). This approach directly pulls q(z|x) and q(z|y) closer to ensure a shared latent space.
Finally, note that it is straightforward to compute both q(z|x) and q(z|y) when we have parallel x and y. However when we have the access to unlabelled data, as described in Section 3.2, we can only use the pseudo x-y pairs that are generated by our NLU or NLG model, such that we can match an inferred posterior to a pre-defined prior reflecting our belief of the shared latent space.

Training Summary
In general, JUG subsumes the following three training scenarios which we will experiment with.
When we have fully labelled x and y, the JUG jointly optimises NLU and NLG in a supervised fashion with the objective as follows: where (X, Y ) denotes the set of labelled examples. Additionally in the fully supervised setting, JUG can be trained to optimise both NLU, NLG and auto-encoding paths. This corresponds to the following objective: Furthermore, when we have additional unlabelled x or y, we optimise a semi-supervised JUG objective as follows: where X denotes the set of utterances and Y denotes the set of formal representations.

Experiments
We experiment on two dialogue datasets with different formal representations to test the generality of our model. The first dataset is E2E (Novikova et al., 2017), which contains utterances annotated with flat slot-value pairs as their semantic representations. The second dataset is the recent weather dataset (Balakrishnan et al., 2019), where both utterances and semantics are represented in tree structures. Examples of the two datasets are provided in tables 1 and 2.
Natural Language "sousa offers british food in the low price range. it is family friendly with a 3 out of 5 star rating. you can find it near the sunshine vegetarian cafe." Semantic Representation restaurant_name=sousa, food=english, price_range=cheap, customer_rating=average, family_friendly=yes, near=sunshine vegetarian cafe

Training Scenarios
We primarily evaluated our models on the raw splits of the original datasets, which enables us to fairly compare fully-supervised JUG with existing work on both NLU and NLG. 4 Statistics of the two datasets can be found in Table 3.
In addition, we set up an experiment to evaluate semi-supervised JUG with a varying amount of labelled training data (5%, 10%, 25%, 50%, 100%, with the rest being unlabelled). Note that the original E2E test set is designed on purpose with unseen slot-values in the test set to make it difficult (Dušek et al., 2018(Dušek et al., , 2020; we remove the distribution bias by randomly re-splitting the E2E dataset. On the contrary, utterances in the weather dataset contains extra tree-structure annotations which make the NLU task a toy problem. We therefore remove these annotations to make NLU more realistic, as shown in the second row of Table 2.
As described in Section 3.4, we can optimise our proposed JUG model in various ways. We investigate the following approaches: JUG basic : this model jointly optimises NLU

Baseline Systems
We compare our proposed model with some existing methods as shown in Table 4 and two designed baselines as follows: Decoupled: The NLU and NLG models are trained separately by supervised learning. Both of the individual models have the same encoderdecoder structure as JUG. However, the main difference is that there is no shared latent variable between the two individual NLU and NLG models.
Augmentation: We pre-train Decoupled models to generate pseudo label from the unlabelled corpus (Lee, 2013) in a setup similar to backtranslation (Sennrich et al., 2016). The pseudo data and labelled data are then used together to fine-tune the pre-trained models.
Among all systems in our experiments, the number of units in LSTM encoder/decoder are set to {150, 300} and the dimension of latent space is 150. The optimiser Adam (Kingma and Ba, 2014) is used with learning rate 1e-3. Batch size is set to {32, 64}. All the models are fully trained and the

Main Results
We start by comparing the JUG basic performance with existing work following the original split of the datasets. The results are shown in Table 4. On E2E dataset, we follow previous work to use F1 of slot-values as the measurement for NLU, and BLEU-4 for NLG. For weather dataset, there is only published results for NLG. It can be observed that the JUG basic model outperforms the previous state-of-the-art NLU and NLG systems on the E2E dataset, and also for NLG on the weather dataset. The results prove the effectiveness of introducing the shared latent variable z for jointly training NLU and NLG. We will further study the impact of the shared z in Section 4.4.2.
We also evaluated the three training scenarios of JUG in the semi-supervised setting, with different proportion of labelled and unlabelled data. The results for E2E is presented in Table 5   However, all our model variants perform better than the baselines on both NLU and NLG. When using only labelled data, our model JUG marginal can surpass Decoupled across all the four measurements. The gains mainly come from the fact that the model uses auto-encoding objectives to help learn a shared semantic space. Compared to Augmentation, JUG marginal also has a 'builtin mechanism' to bootstrap pseudo data on the fly of training (see Section 3.4). When adding extra unlabelled data, our model JUG semi gets further performance boosts and outperforms all baselines by a significant margin.
With the varying proportion of unlabelled data in Figure 2: Visualisation of latent variable z. Given a pair of x and y, z can be sampled from the posterior q(z|x) or q(z|y), denoted by blue and orange dots respectively.
the training set, we see that unlabelled data is helpful in almost all cases. Moreover, the performance gain is the more significant when the labelled data is less. This indicates that the proposed model is especially helpful for low resource setups when there is a limited amount of labelled training examples but more available unlabelled ones.
The results for weather dataset are presented in Table 7 and 8. In this dataset, NLU is more like a semantic parsing task (Berant et al., 2013) and we use exact match accuracy as its measurement. Meanwhile, NLG is measured by BLEU. The results reveal a very similar trend to that in E2E. The generated examples can be found in Appendix.

Analysis
In this section we further analyse the impact of the shared latent variable and also the impact of utilising unlabelled data.

Visualisation of Latent Space
As mentioned in Section 2.1, the latent variable z can be sampled from either posterior approximation q(z|x) or q(z|y). We inspect the latent space in Figure 2 to find out how well the model learns intent sharing. We plot z with the E2E dataset on 2dimentional space using t-SNE projection (Maaten and Hinton, 2008).
We observe two interesting properties. First, for each data point (x, y), the z values sampled from q(z|x) and q(z|y) are close to each other. This reveals that the meanings of x and y are tied in the latent space. Second, there exists distinct clusters in the space of z. By further inspecting the actual examples within each cluster, we found that a cluster represents a similar meaning composition. For instance, the cluster cen-

Impact of the Latent Variable
One novelty of our model is the introduction of shared latent variable z for natural language x and formal representations y. A common problem in neural variational models is that when coupling a powerful autogressive decoder, the decoder tends to learn to ignore z and solely rely on itself to generate the data (Bowman et al., 2016;Goyal et al., 2017). In order to examine to what extent does our model actually rely on the shared variable in both NLU and NLG, we seek for an empirical answer by comparing the JUG basic model with a model variant which uses a random value of z sampled from a normal distribution N (0, 1) during testing. From Table 9, we can observe that there exists a large performance drop if z is assigned with random values. This suggests that JUG indeed relies greatly on the shared variable to produce good-quality x or y. We further analyse the various sources of errors to understand the cases which z helps to improve. On E2E dataset, wrong prediction in NLU comes from either predicting not_mention label for certain slots in ground truth semantics; predicting arbitrary values on slots not present in the ground truth semantics; or predicting wrong values com-  Table 11: Comparison on sources of unlabelled data for semi-supervised learning using only utterances (x), only semantic representations (y) or both (x and y). JUG basic model is trained on 5% of training data.
paring to ground truth. Three types of error are referred to Missing (Mi), Redundant (Re) and Wrong (Wr) in Table 10. For NLG, semantic errors can be either missing or generating wrong slot values in the given semantics (Wen et al., 2015). Our model makes fewer mistakes in all these error sources comparing to the baseline Decoupled. We believe this is because the clustering property learned in the latent space provides better feature representations at a global scale, eventually benefiting NLU and NLG.

Impact of Unlabelled Data Source
In Section 4.3, we found that the performance of our model can be further enhanced by leveraging unlabelled data. As we used both unlabelled utterances and unlabelled semantic representations together, it is unclear if both contributed to the performance gain. To answer this question, we start with the JUG basic model, and experimented with adding unlabelled data from 1) only unlabelled utterances x; 2) only semantic representations y; 3) both x and y. As shown in Table 11, when adding any uni-sourced unlabelled data (x or y), the model is able to improve to a certain extent. However, the performance can be maximised when both data sources are utilised. This strengthens the argument that our model can leverage bi-sourced unlabelled data more effectively via latent space sharing to improve NLU and NLG at the same time.

Related Work
Natural Language Understanding (NLU) refers to the general task of mapping natural language to formal representations. One line of research in the dialogue community aims at detecting slot-value pairs expressed in user utterances as a classification problem (Henderson et al., 2012;Sun et al., 2014;Mrkšić et al., 2017;Vodolán et al., 2017). Another line of work focuses on converting single-turn user utterances to more structured meaning representa-tions as a semantic parsing task (Zettlemoyer and Collins, 2005;Jia and Liang, 2016;Dong and Lapata, 2018;Damonte et al., 2019).
In comparison, Natural Language Generation (NLG) is scoped as the task of generating natural utterances from their formal representations. This is traditionally handled with a pipelined approach (Reiter and Dale, 1997) with content planning and surface realisation (Walker et al., 2001;Stent et al., 2004). More recently, NLG has been formulated as an end-to-end learning problem where text strings are generated with recurrent neural networks conditioning on the formal representation (Wen et al., 2015;Dušek and Jurcicek, 2016;Dušek et al., 2020;Balakrishnan et al., 2019;Tseng et al., 2019).
There has been very recent work which does NLU and NLG jointly. Both Ye et al. (2019) and Cao et al. (2019) explore the duality of semantic parsing and NLG. The former optimises two sequence-to-sequence models using dual information maximisation, while the latter introduces a dual learning framework for semantic parsing.  proposes a learning framework for dual supervised learning (Xia et al., 2017) where both NLU and NLG models are optimised towards a joint objective. Their method brings benefits with annotated data in supervised learning, but does not allow semi-supervised learning with unlabelled data. In contrast to their work, we propose a generative model which couples NLU and NLG with a shared latent variable. We focus on exploring a coupled representation space between natural language and corresponding semantic annotations. As proved in experiments, the information sharing helps our model to leverage unlabelled data for semi-supervised learning, which eventually benefits both NLU and NLG.

Conclusion
We proposed a generative model which couples natural language and formal representations via a shared latent variable. Since the two space is coupled, we gain the luxury of exploiting each unpaired data source and transfer the acquired knowledge to the shared meaning space. This eventually benefits both NLU and NLG, especially in a lowresource scenario. The proposed model is also suitable for other translation tasks between two modalities.
As a final remark, natural language is richer and more informal. NLU needs to handle ambiguous or erroneous user inputs. However, formal representations utilised by an NLG system are more precisely-defined. In future, we aim to refine our generative model to better emphasise this difference of the two tasks.