Generalizable and Explainable Dialogue Generation via Explicit Action Learning

Response generation for task-oriented dialogues implicitly optimizes two objectives at the same time: task completion and language quality. Conditioned response generation serves as an effective approach to separately and better optimize these two objectives. Such an approach relies on system action annotations which are expensive to obtain. To alleviate the need of action annotations, latent action learning is introduced to map each utterance to a latent representation. However, this approach is prone to over-dependence on the training data, and the generalization capability is thus restricted. To address this issue, we propose to learn natural language actions that represent utterances as a span of words. This explicit action representation promotes generalization via the compositional structure of language. It also enables an explainable generation process. Our proposed unsupervised approach learns a memory component to summarize system utterances into a short span of words. To further promote a compact action representation, we propose an auxiliary task that restores state annotations as the summarized dialogue context using the memory component. Our proposed approach outperforms latent action baselines on MultiWOZ, a benchmark multi-domain dataset.


Introduction
Task-oriented dialogue systems complete tasks for users, such as making a hotel reservation or finding train routes, in a multi-turn conversation (Gao et al., 2018;Sun et al., 2016Sun et al., , 2017. The generated system utterances should not only be naturally sound, but more importantly be informative, i.e., to proceed the dialogue towards task completion. To fulfill this requirement, conditioned response generation is widely adopted based on system actions * Rui Zhang is the corresponding author.  (Wen et al., 2017;Chen et al., 2019). The response generation process is decoupled into two consecutive steps, where an action is first selected and then an utterance is generated conditioned on this action. One can optimize each step towards its goal, i.e., informative and naturally sound, without impinging the other (Yarats and Lewis, 2018). However, such approaches rely on action annotations (as in Table  1), which require domain knowledge and extensive efforts to obtain. To deal with the absence of action annotations, latent action learning has been introduced (Zhao et al., 2018;Yarats and Lewis, 2018). System utterances are represented as low-dimensional latent variables by an auto-encoding task (Zhao et al., 2019), and utterances with the same representations are considered to convey similar meanings. Such action representations might be prone to overdependence on the training data, which restricts the model generalization capability, especially when multiple domains are considered. This is because, without explicit supervision, the desired property of capturing the intentions of system utterances in the latent space cannot be enforced (Locatello et al., 2019), which in turn is due to the implicit nature of latent variables. For example, variational auto-encoder (VAE), which is often used for latent action learning, tends to produce a balanced distribution over the latent variables (Zhao et al., 2018), while the true distribution of system actions is highly imbalanced (Budzianowski et al., 2018). The resulting misaligned action representations would confuse the model of both steps and degenerate the sample efficiency in training.
To address the above issues, we propose to learn natural language actions that represent system utterances as a span of words, which explicitly reveal the underlying intentions. Natural language provides unique compositional structure while retaining the representation flexibility. These properties promote model generalization and thus make natural language a flexible representation for capturing characteristics with minimal assumptions (Jiang et al., 2019). Motivated by these advantages, we learn natural language actions by identifying salient words of system utterances. Salient refers to indicative for a prediction task (e.g., sentiment analysis) that takes as input the original utterance. The main rationale is that the principal information that the task concerns can be preserved by just the salient words. For example, the sentiment of sentence "The movie starts out as competent but turn bland" can be revealed by the word "bland" when it is identified salient by considering the complete context. In our scenarios, we consider measuring word saliency in terms of state transitions. This is because state transitions reflect how the intentions of a system utterance influence the dialogue progress, and action representations that capture such influences can well reveal the intentions (Chandak et al., 2019;Tennenholtz and Mannor, 2019;Huang et al., 2020b). By considering salient words for state tracking tasks as actions, we obtain action representations that enjoy the merits of natural language and indeed capture the characteristics of interest, i.e., intentions of system utterances.
Obtaining salient words by applying existing saliency identification approaches (Ribeiro et al., 2018) is, however, unable to produce unified action representations. Specifically, system utterances with the same intention might not share similar wordings, and existing attribution approaches can only identify salient words within utterances. We tackle this challenge by proposing a memoryaugmented saliency approach that identifies salient words from a broader vocabulary. The vocabulary consists of all the words that could compose natural language actions, 1 and each word is stored as a slot in the memory component. By incorporating the memory component into a dialogue state tracking model, we use each system utterance as a query to perform memory retrieval, and the retrieval results are considered as salient words. The retrieval results might contain words that are redundant since we do not have direct supervision for the retrieval operations. For example, the resulting salient words might be "but turn bland" in the example shown earlier, which include unnecessary words and may lead to degenerated action results. To obtain compact action representations, we propose an auxiliary task based on pseudo parallel corpus, i.e., dialogue context and state annotation pairs. We observe that dialogue states serve as good examples of how compact representation should be. Therefore, we use the encoded dialogue context as query and ask the memory component to reconstruct its text-based dialogue states. In this way, the obtained concise actions generalize better and can be easily interpreted.
Our contributions are summarized as follows: • We propose to learn explicit action representations (in contrast to latent action representations) for task-oriented dialogues, which promotes more generalizable and explainable dialogue generation.
• We propose a novel memory based approach with a pseudo parallel training scheme to obtain unified and compact action representations.
• We conduct experiments on a benchmark multidomain dataset. Results show that our approach outperforms the state-of-the-art on both in-domain and cross-domain settings.

Preliminaries
Let {d i |1 ≤ i ≤ N } be a set of dialogues, and each dialogue contains n d turns: where c t is the context at turn t, and a t is the dialogue action of system utterance x t . The context c t = {u 1 , x 1 , ..., u t } consists of the dialogue history of user utterances u and system utterances x. Conditioned response generation tackles the context-to-response generation problem p(x|c) via two consecutive steps: a content planning step decides a dialogue action to proceed the dialogues p l (a|c); and a surface realization step further trans-forms the decided action into naturally sound utterances p r (x|a, c). Using the two-step process, response generation could be optimized towards better task completion while maintaining high-quality language quality (Huang et al., 2020a;Zhao et al., 2019). The optimization process also consists of two parts. Firstly, context-action pairs are used to train the content planning model p l (a|c) using the cross-entropy loss.
Then, the surface realization model p r (x|a, c) is optimized from the (c t , a t , x t ) triples to maximize the likelihood of ground-truth responses Furthermore, to achieve better task completion, reinforcement learning (RL) is adopted to boost the pre-trained supervised models (Yarats and Lewis, 2018;Zhao et al., 2019). The rewards in terms of task completion (e.g., success rate) is usually computed based on the final generated response (Budzianowski et al., 2018). To avoid divergence from fluent utterances, this fine-tuning stage focuses on the content planning model p l (a|c) and keeps the parameters of p r (x|a, c) fixed. The reward R t at each turn is back-propagated via policy gradients as: where φ denotes the parameters of model p l . In order to enable conditioned response generation when action annotations are absent, latent action learning is introduced. Given dialogues {(c t , x t )|1 ≤ t ≤ n d }, latent action learning aims to map each utterance to a latent representation z d (x), e.g, one-hot (Wen et al., 2017), or continuous (Zhao et al., 2017). Based on the obtained (c t , z d (x t ), x t ) triples, conditioned response generation is run as mentioned above. Existing latent action learning approaches mostly build on the idea of variational inference, where a latent space is found to reconstruct system utterances and thus encodes the main characteristics of utterances (Zhao et al., 2018;Huang et al., 2020a). The action representations learned from the latent space are, however, difficult to generalize due to the implicit nature and thus cause the sample inefficiency issue.

Overview
We study the problem of natural language action learning for task-oriented dialogues. Specifically, we aim to represent each system utterance x t as a sequence of word tokens l(x t ) = [w 1 , w 2 , ..., w n ] without dialogue action annotations. The conditioned response generation is then performed us- Since natural language actions (i.e., sequences of tokens) encode the intention of system utterances in a compact and expressive way, both dialogue planning and language generation could achieve an improved generalization capability.
We design a memory component to identify the salient words of system utterances in terms of modeling state transitions (Sec. 3.2). To further boost the memory's capability in learning compact natural language actions, we propose a novel auxiliary task to identify salient words of dialogue context in a supervised setting (Sec. 3.3). Furthermore,we propose to take more advantage from the action learning phase by reusing the memory component for conditioned response generation (Sec. 3.4).

Memory Augmented Action Learning
We aim to obtain salient words that are indicative for the effects of system utterances in state transition. To model the such effects, we train a dialogue state tracking model that takes as input the system utterances. We then regard the sequence of words that substitute the system utterance and get similar state tracking results as salient words. To obtain sequences of words (i.e., natural language actions) of such characteristics, we use a learnable memory component that stores all potential words to form action representations, and optimize the memory in a self-supervised way.
Before presenting the proposed action learning approach, we first briefly introduce dialogue state tracking tasks.
is the number of all slot-value pairs. Dialogue state tracking is usually formulated as a multilabel learning problem where the state at turn t predicted by modeling the conditional distribution p(b t |c t ) = p(b t |u t , x t−1 , b t−1 ), where b t−1 is the dialogue state in the previous turn. To model this conditional distribution, a state tracking model p B (u t , x t−1 , b t−1 ) mainly employs an utterance en-coder, a context encoder to work with a slot-value predictor that estimates whether a slot-value pair should be included in the dialogue states (Lee et al., 2019). Specifically, the predictor takes as input a slot-value pair (s i , e i ), and the encoded utterances h utt ∈ R D and context h ctx ∈ R D from the utterance encoder f utt (u t , x t−1 ) and context encoder f ctx (b t−1 ) respectively, and D is the hidden dimension. The prediction is then performed by aggregating the results of slot-value predictor f val (h utt , h ctx , (s i , e i )) for the complete N b slotvalue pairs. We optimize the state tracking model using the cross-entropy loss: where the parameters of p B , which include f utt , f ctx , and f val , are jointly trained.
Based on the learned state tracking model, a straightforward idea of obtaining salient words is to apply importance attribution approaches. Specifically, these approaches measure the importance of each word by observing the prediction difference caused by replacing it (Ribeiro et al., 2018;Jin et al., 2020). As discussed before, this would result in different action representations for utterances with the same action. To address this issue, we consider learning action representations from a broader vocabulary, which releases the constraint of selecting salient words only within utterances.

Key-Value Memory Component
To this aim, we propose to use a memory component as the additional vocabulary. Note that the selection of words to build the vocabulary is task dependent, and we select the words appearing in state annotations and content words 2 extracted from task descriptions provided in the dataset (Budzianowski et al., 2018). This simple strategy is intuitive and turns out to be empirically competitive.
Given the built vocabulary, we adopt a key-value memory bank, where each memory slot stores a word included in the vocabulary. Each memory slot is associated with a key vector and a value vector, given by learnable matrix K ∈ R D×Nv and V ∈ R D×Nv respectively, where N v is the number of words stored in the memory. The memory is utilized to obtain action representations by context-aware memory retrieval. Specifically, we regard the encoded utterance h utt from the trained 2 We consider nouns, verbs, and adjectives as content words. dialogue state tracking model as the query vector q ∈ R D . The retrieval is then conducted by computing the attention weights as where z ∈ R Nv is a probability vector over the slots. Memory slots with higher probability indicate that the corresponding words are expected to be more salient to represent the system utterance.
We could assume a natural language action l(x t−1 ) containing k words is sampled k times from a categorical distribution given by z without replacement, where the value of k is set as a hyper-parameter.

Multi-Hop Mechanism
Building on the above sampling strategy, we further recognize that it is not plausible to assume natural language actions are of the same length by setting k as a hyper-parameter. This is because the conveyed information of system utterances can vary from each other. It is common to see certain utterances expressing more intentions than others, especially those directly determine task completion after information is accumulated. Thus, inspired by end-to-end memory network (Sukhbaatar et al., 2015), we design a multi-hop mechanism to adaptively decide the length of natural language actions. Specifically, after obtaining the probability vector z, we update the query based on the original query q and a weighted sum of memory values: where V ∈ R D×Nv is the memory value matrix. Note that we denote the initial query vector q (i.e., h utt ) as q 1 for simplicity. Using query q 2 , we could get a retrieval result z 2 as the same way in Eqn. 5. By conducting such k-hop memory operation (i.e., k times retrieval using corresponding updated queries), we obtain k different categorical distributions. We now assume that each word is sampled from one distribution accordingly, and the length of natural language actions is indeed the number of hops carried out. Thus, by adaptively deciding the number of hops, we could learn variable length natural language actions. To this aim, we design an action gate component that predicts whether to carry out a next retrieval based on the current updated query. We perform such prediction based on the updated query, since it aggregates the information of former query and memory slots after every retrieval operation. More specifically, we formulate the action gate as a binary random variable t, and its distribution is modeled as: where σ(·) is the sigmoid function, and G ∈ R D is a learnable vector. In this way, we can obtain natural language actions of appropriate length, which are sampled from the distributions obtained before the action gate indicate a stop of retrieving.

Training
The memory component and action gate are endto-end trained in a self-supervised way, where the feedback is whether an utterance and its action representation lead to similar state transitions, We can measure such similarity using a dialogue state tracking (DST) model. However, a direct application of the DST model trained by Eqn. 4 might be prone to attribute changes between original utterances and compact natural language actions, which results in insufficient feedback. To address this issue, we adopt a denoising training strategy inspired by unsupervised machine translation (Lample et al., 2018(Lample et al., , 2019, and obtain a DST model that is more robust to the attribute transformation. Specifically, we apply a noise function g(x) to the utterances, and modify the DST model training loss as: where the noise function corrupts the input utterance by performing word drops and word order shuffling as specified in Lample et al. (2018).
With a slight abuse of notations, we use p B (x t−1 ) to denote p B (u t , x t−1 , c t−1 ). We formulate the training loss for self-supervised task as: where KL is Kullback-Leibler divergence, and l(x t−1 )) is the natural language action obtained via the memory component. This loss enforces the learned action representations to restore both the ground truth and predicted state transitions. Note that the natural language actions are sampled from categorical distributions, which is nondifferentiable. To get gradients for the memory component during back-propagation, we apply a continuous approximation, i.e., using gumbelsoftmax trick instead to conduct sampling (Jang et al., 2016), to enable end-to-end differentiability.

Learning with Pseudo Parallel Corpus
Recall that we aim to learn natural language actions that are not only expressive but also compact, i.e., only including words that encode system intentions. Although the memory based approach could identify salient words from a broader vocabulary, the identified words might degenerate to the words making up most of the original utterances, which introduces redundant words into action representations. To avoid such suboptimal scenarios, we propose a supervised auxiliary task to further regularize the memory component. We use the encoded context h ctx given by f ctx (b t ) from the dialogue state tracking model as query vectors, and attempt to recover the dialogue state from the memory component. Here, we consider word-based dialogue state representations instead of multi-hot representations, b ∈ {0, 1} N b . For example, the dialogue state "food= european, price-range=moderate" is transformed to a text span ['food', 'european', 'price-range','moderate']. We then form a pseudo parallel corpus by pairing the word-based dialogue states and the corresponding encoded states as t is the text span for b t . We train the memory component using the pseudo parallel corpus as: where k(b) is the length of the text span b text t , and g i ∈ {0, 1} indicates whether the multi-hop operation should end at step i, and only take value one when i equals k(b). For each pair in P, the loss consists of two terms: the first one further guides the memory component to identify salient words; meanwhile the second term enforces the memory component to only pick salient words and promotes action representations to remain compact.
The overall training objective function of the natural language action learning is: where α and β are hyper-parameters. The reason we include the term L dst during action learning is to ensure the DST model provides sufficient supervision. Some components in the DST model (i.e., f utt and f ctx ) are updated via L mem and L par , and by considering L dst , we could avoid a divergence from the state tracking task.

Conditional Response Generation
After obtaining natural language actions, we enrich the dialogues as where l(x t ) is the natural language action of utterance x t . We could then run conditioned response generation to train content planning and language generation models as Eqn. 1-3. The learning efficiency can be improved by the more compact and noise-free action space. Moreover, the natural language actions present abundant information of correlations among actions, which allows for better generalization over actions (Chandak et al., 2019;. To further enhance the generalization capability and boost the learning efficiency, we consider re-use the memory component for conditioned response generation. Specifically, we focus on the content planning model p l (a|c), which aims to decide one natural language action from the action set for response generation 3 . We could implement the content planning model as a network that encodes the dialogue context c into a hidden state of the same dimension as query vectors in the memory component. By using the encoded results as query for memory retrieval, we obtain a distribution given by the retrieval results. We then select the action of highest probability determined by the obtained distribution as model output. This fine-tuning approach could not only reduce the model complexity for content planning, but also better harvest the knowledge gained in action learning phase.

Experiments
To show the effectiveness of the proposed approach, memory-augmented saliency with parallel corpus (MASP), we experiment on two dialogue generation settings (Sec. 4.1). We compare against stateof-the-art approaches in both settings (Sec. 4.2). We analyze the effectiveness of MASP components under different supervision ratios, and discuss how explainable generation is achieved (Sec. 4.3).

Settings
We use MultiWOZ (Budzianowski et al., 2018), a multi-domain human-human conversational dataset in our experiments. It contains in total 8438 dialogues spanning over seven domains, and each dialogue has 13.7 turns on average. We use the separation of training, validation and testing data as original MultiWOZ dataset. We use the evaluation metrics as Budzianowski et al. (2018) to measure dialogue task completion, which are how often the system provides a correct entity (Inform) and answers all the requested information (Success). We use BLEU (Papineni et al., 2002) to measure the language quality of generated responses.
We use a three-layer transformer (Vaswani et al., 2017) with a hidden size of 128 and 4 heads as our base model for content planning and response generation, i.e., p l (a|c) and p r (a, c) , respectively. We use grid search to find the best hyperparameters for the models based on validation performance, which we use a combination of Inform, Success and BLEU scores to measure. We choose the embedding dimensionality d among {50, 75, 100, 150, 200}, the hyperparameters α and β in [0.01, 1.0].
We consider two settings to thoroughly evaluate the conditioned response generation: multi-domain joint training and cross-domain response generation. In the first setting, we train MASP and other baselines using different sizes of the training dialogues (20%/50%/full), and for the tasks using 20% or 50% of data, the distribution of dialogues across domains are kept the same as the full training set. In the cross-domain setting, we adopt a leave-one-out approach to evaluate the generalization ability via a more challenging few-shot learning task. Specifically, we use one domain as low-resource target domain (with only 1% of dialogues are available for training) while the others as source domains.
We compare with the following baselines that do not consider conditioned generation: (1) Seqto-Seq (Budzianowski et al., 2018) implemented based on transformer (Vaswani et al., 2017); (2) TSCP (Lei et al., 2018); and two baselines that adopt latent action learning for conditioned generation: (3) LaRL (Zhao et al., 2019); (4) MALA (Huang et al., 2020a). Note that for these two approaches, we experiment with both discrete and continuous latent action representations. We also compare the full model MASP with its two variants: (1) Post-hoc Saliency obtains action representations via the importance attribution technique as Jin et al. (2020); (2) Memory-based Saliency employs the same memory component as MASP but trained without the pseudo parallel corpus. Table 2 shows that MASP outperforms baselines in the multi-domain joint training setting. MASP achieves better dialogue task completion (measured by Inform and Success) and language quality (measured by BLEU), especially in low resource scenarios. For example, MASP (70.2) outperforms MALA (63.5) by 10.5% under Inform when having 20% training data. Meanwhile, we also find that the memory component and pseudo parallel enhanced training are essential for getting effective action representations. For example, Post-hoc Saliency (57.9) is outperformed by a large margin compared to MALA (65.0) under Success when having 50% training data, while MASP (71.5) achieves a performance 10% gain over MALA. This validates that the unified and compact characteristics are required for natural language actions to boost conditioned generation. We further find that the contribution of the memory component and pseudo parallel corpus may vary in different ratios of training data. For example, the memory component brings 11.9% and 3.0% improvements compared to Posthoc Saliency under Inform when the ratio is 50% and 20% respectively, while the pseudo parallel corpus brings 3.4% and 8.5% improvements compared to Memory-based Saliency. This is largely because the memory component could easily degenerate to utterance restoration when available training data is less, and thus the regularization provided by pseudo parallel corpus is more desired.

Overall Results
For cross-domain setting, Table 3 includes three representative domains (hotel, attraction, and train), and the observations on other domains are consistent. 4 The results show that MASP significantly outperforms the baselines in each configuration. For example, MASP (39.2) outperforms MALA (33.9) by 15.6% under Inform in hotel domain. By comparing results of Memory-based Saliency and MALA in attraction and train, we find that without pseudo parallel corpus, natural language actions could still be competitive occasionally. We will conduct a detailed analysis in the next section. We also find that continuous latent action approaches achieve comparable results as their discrete counterparts, while the results are opposite in the joint training setting. For example, MALA with continuous action (41.9) is slightly outperformed by its discrete counterparts (42.2) under Success using attraction as target. This is largely because the challenging cross-domain task could result in many mis-assigned action labels, and continuous action representations can still preserve certain knowledge of similarities among actions.

Discussions
We first study the effects of different components of MASP in the cross-domain setting. We compare MASP and its two variants with MALA (discrete action) under different dialogue ratios in target domains. The results are shown in Fig. 1(a) and Fig. 1. We can see that Memory-based Saliency is more comparable to MALA when using train as target domain, especially when the dialogue ratio is low. This is largely because there are many shared knowledge of system intentions and state transitions between taxi and train domains, and the memory component could benefit from such knowledge via the dialogue state tracking model. On the other hand, for target domains that do not have  much advantage (e.g., attraction), the pseudo parallel corpus might contribute more to action learning. This conclusion is also consistent with what we observe in multi-domain joint training.
Last, we study the effects of content planning model design. We consider mainly two types of content planning model that works on natural language actions: action decoder and classifier, denoted as Act-DEC and Act-CLS, respectively. Specifically, an action decoder generates a text span and feed it to the language generation model, while action classifier conducts classification to select one action from the action set given by the training set. We also consider to enhance the planning model with (1) action embeddings computed by summing word-embedding of words in actions; (2) memory component as discussed in 3.4. From the results shown in Table 4, we can see that reusing the memory component could effectively improve the performance of conditioned response generation. We also find that action classifier generally perform better than action decoder, while the latter is more flexible to manipulate the content to generate. This is aligned with our intuition since more specific and task-relevant intentions are more favorable for task-oriented dialogues.
Moreover, through natural language actions, we could obtain a transparent response generation process, where the decided intermediate action is human-understandable. Such transparency could help alleviate the credit assigning issue by identifying the effectiveness of dialogue planning and surface realization. Table 5 shows that the proposed approach can obtain interpretable action representations (e.g., "request-departure") for the utterances that have the same intention but with different wording. This table also shows an error that our approach made in action learning, where the sentence highlighted in bold expresses "inform-address" instead of "inform-area". This might be caused by that the utterance contains multiple intentions and is thus more challenging for action learning. Table 6 shows that, with the learned natural language actions, we can better identify the source of errors in conditioned response generation. The two generated responses read naturally sound but express inappropriate intentions. The upper and lower examples showcase an action decision error and a language generation error, respectively. These help recognize the cause of errors and guide further optimization of the relevant components (content planning model or surface realization model).

Related Work
Early studies of conditioned response generation focus on enriching the meaning representations in task-oriented dialogues, e.g., utilizing graph structures and hierarchies among actions (Chen et al., 2019;Yang et al., 2020), decomposing into i am sorry , to help narrow down the results please reply with where you will be departing from i am going to need a little more information from you . where will you be leaving from ?
{'inform', 'price', 'area', 'offer', 'reservation'} the address is 169 high street chesterton and the price range is fairly expensive . would you like to make reservations ?
i have one that is called saigon city . it 's more expensive and located in the north . can i make a reservation for you ?
we have 14 indian restaurants in the expensive category . do you have any more information to narrow down the search ? * The parts highlighted in bold is the missing information of action representation. fine-grained actions (Shu et al., 2019), or encoding syntax attributes (Balakrishnan et al., 2019). Since these approaches often assume expensive action annotations, recent years have seen a growing interest in learning latent actions in an unsupervised way (Zhao et al., 2019;Huang et al., 2020a). These approaches build on either adversarial learning (Hu et al., 2017;Wang et al., 2018;Yang et al., 2018) or variational inference (Kingma and Welling, 2014) and encode all system utterances via a self-reconstruction task or distant supervision (Yarats and Lewis, 2018). Due to their implicit nature, latent actions are difficult to generalize, and we aim to overcome this limitation by learning explicit action representations.
Our study is also related to attribution approaches, which aims to find features or regions of input that are important for tasks. Different types of techniques, including gradient-based (Selvaraju et al., 2017) andpost-hoc (Ribeiro et al., 2018), are applied for reinforcement learning (Mott et al., 2019), computer vision (Adebayo et al., 2018), and text classification (Jin et al., 2020). While these works focus on interpreting model behaviors, we aim to find salient words beyond input and utilize them as action representations.

Conclusions
We propose explicit action learning to achieve generalizable and interpretable dialogue generation. Our proposed model MASP learns unified and compact action representations. We propose a memory component that summarizes system utterances into natural language actions, i.e., spans of words from a unified vocabulary. We further introduce an auxiliary task to encourage natural language actions to only preserve task-relevant information. Experimental results confirm that MASP achieves better performance compared with the state-of-the-art in different settings, especially when supervision is limited. We plan to consider structural action representation learning that could convey more information as future work.