Dual Inference for Improving Language Understanding and Generation

Natural language understanding (NLU) and Natural language generation (NLG) tasks hold a strong dual relationship, where NLU aims at predicting semantic labels based on natural language utterances and NLG does the opposite. The prior work mainly focused on exploiting the duality in model training in order to obtain the models with better performance. However, regarding the fast-growing scale of models in the current NLP area, sometimes we may have difficulty retraining whole NLU and NLG models. To better address the issue, this paper proposes to leverage the duality in the inference stage without the need of retraining. The experiments on three benchmark datasets demonstrate the effectiveness of the proposed method in both NLU and NLG, providing the great potential of practical usage.


Introduction
Various tasks, though different in their goals and formations, are usually not independent and yield diverse relationships between each other within each domain. It has been found that many tasks come with a dual form, where we could directly swap the input and the target of a task to formulate into another task. Such structural duality emerges as one of the important relationship for further investigation, which has been utilized in many tasks including machine translation (Wu et al., 2016), speech recognition and synthesis (Tjandra et al., 2017), and so on. Previous work first exploited the duality of the task pairs and proposed supervised (Xia et al., 2017) and unsupervised (reinforcement learning) (He et al., 2016) learning frameworks in machine translation. The recent studies magnified the importance of the duality by revealing exploitation of it could boost the learning for both tasks.
Natural language understanding (NLU) (Tur and De Mori, 2011;Hakkani-Tür et al., 2016) and natural language generation (NLG) (Wen et al., 2015;Su et al., 2018) are two major components in modular conversational systems, where NLU extracts core semantic concepts from the given utterances, and NLG constructs the associated sentences based on the given semantic representations. Su et al. (2019) was the first attempt that leveraged the duality in dialogue modeling and employed the dual supervised learning framework for training NLU and NLG. Furthermore, Su et al. (2020) proposed a joint learning framework that can train two modules seamlessly towards the potential of unsupervised NLU and NLG. Recently, Zhu et al. (2020) proposed a semi-supervised framework to learn NLU with an auxiliary generation model for pseudo-labeling to make use of unlabeled data.
Despite the effectiveness showed by the prior work, they all focused on leveraging the duality in the training process to obtain powerful NLU and NLG models. However, there has been little investigation on how to leverage the dual relationship into the inference stage. Considering the fast-growing scale of models in the current NLP area, such as BERT (Devlin et al., 2018) and GPT-3 (Brown et al., 2020), retraining the whole models may be difficult. Due to the constraint, this paper introduces a dual inference framework, which takes the advantage of existing models from two dual tasks without re-training (Xia et al., 2017), to perform inference for each individual task regarding the duality between NLU and NLG. The contributions can be summarized as 3-fold: • The paper is the first work that proposes a dual inference framework for NLU and NLG to utilize their duality without model re-training.
• The presented framework is flexible for diverse trained models, showing the potential of practical applications and broader usage.
• The experiments on diverse benchmark datasets consistently validate the effectiveness of the proposed method.

Proposed Dual Inference Framework
With the semantics space X and the natural language space Y, given n data pairs sampled from the joint space X × Y, the goal of NLG is to generate corresponding utterances based on given semantics. In other words, the task is to learn a mapping function f (x; θ x→y ) to transform semantic representations into natural language. In contrast, the goal of NLU is to capture the core meaning from utterances, finding a function g(y; θ y→x ) to predict semantic representations given natural language utterances. Note that in this paper, the NLU task has two parts: (1) intent prediction and (2) slot filling. Hence, x is defined as a sequence of words (x = {x i }), while semantics y can be divided into an intent y I and a sequence of slot tags y S = {y S i }, (y = (y I , y S )). Considering that this paper focuses on the inference stage, diverse strategies can be applied to train these modules. Here we conduct a typical strategy based on maximum likelihood estimation (MLE) of the parameterized conditional distribution by the trainable parameters θ x→y and θ y→x .

Dual Inference
After obtaining the parameters θ x→y and θ y→x in the training stage, a normal inference process works as follows: where P (.) represents the probability distribution, and x and y stand for model prediction. We can leverage the duality between f (x) and g(y) into the inference processes (Xia et al., 2017). By taking NLG as an example, the core concept of dual inference is to dissemble the normal inference function into two parts: (1) inference based on the forward model θ x→y and (2) inference based on the backward model θ y→x . The inference process can now be rewritten into the following: where α is the adjustable weight for balancing two inference components. Based on Bayes theorem, the second term in (1) can be expended as follows: where θ x and θ y are parameters for the marginal distribution of x and y. Finally, the inference process considers not only the forward pass but also the backward model of the dual task. Formally, the dual inference process of NLU and NLG can be written as: where we introduce an additional weight β to adjust the influence of marginals. The idea behind this inference method is intuitive: the prediction from a model is reliable when the original input can be reconstructed based on it. Note that this framework is flexible for any trained models (θ x→y and θ y→x ), and leveraging the duality does not need any model re-training but inference.

Marginal Distribution Estimation
As derived in the previous section, marginal distributions of semantics P (x) and language P (y) are required in our dual inference method. We follow the prior work for estimating marginals (Su et al., 2019).

Language Model
We train an RNN-based language model (Mikolov et al., 2010;Sundermeyer et al., 2012) to estimate the distribution of natural language sentences P (y) by the cross entropy objective.
Masked Prediction of Semantic Labels A semantic frames x contains an intent label and some slot-value pairs; for example, {Intent: "atis flight", fromloc.city name: "kansas city", toloc.city name:  "los angeles", depart date.month name: "april ninth"}. A semantic frame is a parallel set of discrete labels which is not suitable to model by autoregressiveness like language modeling. Prior work (Su et al., 2019 simplified the NLU task and treated semantics as a finite number of labels, and they utilized masked autoencoders (MADE) (Germain et al., 2015) to estimate the joint distribution. However, the slot values can be arbitrary word sequences in the regular NLU setting, so MADE is no longer applicable for benchmark NLU datasets.
Considering the issue about scalability and the parallel nature, we use non-autoregressive masked models (Devlin et al., 2018) to predict the semantic labels instead of MADE. The masked model is a two-layer Transformer (Vaswani et al., 2017) illustrated in Figure 1. We first encode the slot-value pairs using a bidirectional LSTM, where an intent or each slot-value pair has a corresponding encoded feature vector. Subsequently, in each iteration, we mask out some encoded features from the input and use the masked slots or intent as the targets. When estimating the density of a given semantic frame, we mask out a random input semantic feature three times and use the cumulative product of probability as the marginal distribution to predict the masked slot.

Experiments
To evaluate the proposed methods on a fair basis, we take two simple GRU-based models for both NLU and NLG, and the details can be found in Appendix D. For NLU, accuracy and F1 measure are reported for intent prediction and slot filling respectively, while for NLG, the evaluation metrics include BLEU and ROUGE-(1, 2, L) scores with multiple references. The hyperparameters and other training settings are reported in Appendix A.

Datasets
The benchmark datasets conducted in our experiments are listed as follows: • ATIS (Hemphill et al., 1990): an NLU dataset containing audio recordings of people making flight reservations. It has sentence-level intents and word-level slot tags.
• SNIPS (Coucke et al., 2018): an NLU dataset focusing on evaluating voice assistants for multiple domains, which has sentence-level intents and word-level slot tags.
• E2E NLG (Novikova et al., 2017): an NLG dataset in the restaurant domain, where each meaning representation has up to 5 references in natural language and no intent labels.
We use the open-sourced Tokenizers 2 package for preprocessing with byte-pair-encoding (BPE) (Sennrich et al., 2016). The details of datasets are shown in Table 1, where the vocabulary size is based on BPE subwords. We augment NLU data for NLG usage and NLG data for NLU usage, and the augmentation strategy are detailed in Appendix C.

Results and Analysis
Three baselines are performed for each dataset: (1) Iterative Baseline: simply training NLU and NLG iteratively, (2) Dual Supervised Learning (Su et al., 2019), and (3) Joint Baseline: the output from one model is sent to another as in Su et al. (2020) 3 . In joint baselines, the outputs of NLU are intent and IOB-slot tags, whose modalities are different from the NLG input, so a simple matching method is performed (see Appendix C).
For each trained baseline, the proposed dual inference technique is applied. The inference details are reported in Appendix B. We try two different approaches of searching inference parameters (α and β):  Table 2: For NLU, accuracy and F1 measure are reported for intent prediction and slot filling respectively. The NLG performance is reported by BLEU, ROUGE-1, ROUGE-2, and ROUGE-L of models (%). All reported numbers are averaged over three different runs.
The results are shown in Table 2. For ATIS, all NLU models achieve the best performance by selecting the parameters for intent prediction and slot filling individually. For NLG, the models with (α=0.5, β=0.5) outperform the baselines and the ones with (α * , β * ), probably because of the discrepancy between the validation set and the test set. In the results of SNIPS, for the models mainly trained by standard supervised learning (iterative baseline and dual supervised learning), the proposed method with (α=0.5, β=0.5) outperform the others in both NLU and NLG. However, the model trained with the connection between NLU and NLG behaves different, which performs best on slot F-1 and ROUGE with (α * , β * ) and performs best on intent accuracy and ROUGE with (α=0.5, β=0.5).
In summary, the proposed dual inference technique can consistently improve the performance of NLU and NLG models trained by different learning algorithms, showing its generalization to multiple datasets/domains and flexibility of diverse training baselines. Furthermore, for the models learned by standard supervised learning, simply picking the inference parameters (α=0.5, β=0.5) would possibly provide improvement on performance.

Conclusion
This paper introduces a dual inference framework for NLU and NLG, enabling us to leverage the duality between the tasks without re-training the large-scale models. The benchmark experiments demonstrate the effectiveness of the proposed dual inference approach for both NLU and NLG trained by different learning algorithms even without sophisticated parameter search on different datasets, showing the great potential of future usage.

A Training Details
In all experiments, we use mini-batch Adam as the optimizer with each batch of 48 examples on Nvidia Tesla V100. 10 training epochs were performed without early stop, the hidden size of network layers is 200, and word embedding is of size 50. The ratio of teacher forcing is set to 0.9.

B Inference Details
During inference, we use beam search with beam size equal to 20. When applying dual inference, we use beam search to decode 20 possible hypotheses with the primal model (e.g. NLG). Then, we take the dual model (e.g. NLU) and the marginal models to compute the probabilities of these hypotheses in the opposite direction. Finally, we re-rank the hypotheses using the probabilities in both directions (e.g. NLG and NLU) and select the top-1 ranked hypothesis.
To make the NLU model be able to decode different hypotheses, we need to use the auto-regressive architecture for slot filling, as described in Appendix D.
C Data Augmentation NLU → NLG As described in 3.2, the modality of the NLU outputs (an intent and a sequence of IOB-slot tags) are different from the modality of the NLG inputs (semantic frame containing intent (if applicable) and slot-value pairs). Therefore, we propose a matching method: for each word, the NLU model will predict an IOB tag ∈ {O, B-slot, I-slot}, we simply drop the I-and B-and aggregate all the words with the same slot then combine it with the predicted intent.
For example, if given the word sequence:

D Model Structure
For NLU, the model is a simple GRU (Cho et al., 2014) with a word and last output as input at each timestep i and a linear layer at the end for intent prediction based on the final hidden state: The model for NLG is almost the same but with an additional encoder for encoding semantic frames, where slot-value pairs are encoded into semantic vectors for basic attention, the mean-pooled semantic vector is used as initial state. We borrow the encoder structure in Zhu et al. (2020) for our experiments. At each timestep i, the last predicted word and the aggregated semantic vector from attention are used as the input: