Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention

Semantically controlled neural response generation on limited-domain has achieved great performance. However, moving towards multi-domain large-scale scenarios are shown to be difficult because the possible combinations of semantic inputs grow exponentially with the number of domains. To alleviate such scalability issue, we exploit the structure of dialog acts to build a multi-layer hierarchical graph, where each act is represented as a root-to-leaf route on the graph. Then, we incorporate such graph structure prior as an inductive bias to build a hierarchical disentangled self-attention network, where we disentangle attention heads to model designated nodes on the dialog act graph. By activating different (disentangled) heads at each layer, combinatorially many dialog act semantics can be modeled to control the neural response generation. On the large-scale Multi-Domain-WOZ dataset, our model can yield a significant improvement over the baselines on various automatic and human evaluation metrics.


Introduction
Conversational artificial intelligence (Young et al., 2013) is one of the critical milestones in artificial intelligence. Recently, there have been increasing interests in industrial companies to build task-oriented conversational agents Li et al., 2017;Rojas-Barahona et al., 2017) to solve pre-defined tasks such as restaurant or flight bookings, etc (see Figure 1 for an example dialog from MultiWOZ (Budzianowski et al., 2018)). Traditional agents are built based on slotfilling techniques, which requires significant human handcraft efforts. And it is hard to generate naturally sounding utterances in a generalizable and scalable manner. Therefore, different semantically controlled neural language generation models have been developed (Wen et al., 2015(Wen et al., , 2016aDusek and Jurcícek, 2016) to replace the traditional systems, where an explicit semantic representation (dialog act) are used to influence the RNN generation. The canonical approach is proposed in (Wen et al., 2015) to encode each individual dialog act as a unique vector and use it as an extra input feature into the cell of long short-term memory (LSTM) to influence the generation. As pointed in (Wen et al., 2016b), these models though achieving good performance on limited domains, suffer from scalability problem as the possible dialog acts grow combinatorially with the number of domains.
In order to alleviate such issue, we propose a hierarchical graph representation by leveraging the structural property of dialog acts. Specifically, we first build a multi-layer tree to represent the entire dialog act space based on their interrelationships. Then, we merge the tree nodes with the same semantic meaning to construct an acyclic multi-layered graph, where each dialog act is interpreted as a root-to-leaf route on the graph. Such graph representation of dialog acts not only grasps the inter-relationships between different acts but also reduces the exponential representation cost to almost linear, which will also endow it with greater generalization ability. Instead of simply feeding such vectorized representation as an external feature vector to the neural networks, we propose to incorporate such a structure act as an inductive prior for designing the neural architecture, which we name as hierarchical disentangled self-attention network (HDSA). In Figure 2, we show how the dialog act graph structure is explicitly encoded into model architecture. Specifically, HDSA consists of multiple layers of disentangled self-attention modules (DSA). Each DSA has multiple switches to set the on/off state for its heads, and each head is bound for modeling a designated node in the dialog act graph. At the train- ing stage, conditioned on the given dialog acts and the target output sentences, we only activate the heads in HDSA corresponding to the given acts (i.e., the path in the graph) to activate the heads with their designated semantics. At test time, we first predict the dialog acts and then use them to activate the corresponding heads to generate the output sequence, thereby controlling the semantics of the generated responses without handcrafting rules. As depicted in Figure 2, by gradually activating nodes from domain → action → slot, the model is able to narrow its response down to specifically querying the user about the color and type of the taxi, which provides both strong controllability and interpretability. Experiment results on the large-scale Multi-WOZ dataset (Budzianowski et al., 2018) show that our HDSA significantly outperforms other competing algorithms. 1 In particular, the proposed hierarchical dialog act representation effectively 1 The code and data are released in https://github. com/wenhuchen/HDSA-Dialog improves the generalization ability on the unseen test cases and decreases the sample complexity on seen cases. In summary, our contributions include: (i) we propose a hierarchical graph representation of dialog acts to exploit their inter-relationships, which greatly reduces the sample complexity and improves generalization, (ii) we propose to incorporate the structure prior in semantic space to design HDSA to explicitly model the semantics of neural generation, and outperforms baselines.

Related Work & Background
Canonical task-oriented dialog systems are built as pipelines of separately trained modules: (i) user intention classification (Shi et al., 2016;Goo et al., 2018), which is for understanding human intention. (ii) belief state tracker Mrksic et al., 2017a,b;Zhong et al., 2018;, which is used to track user's query constraint and formulate DB query to retrieve entries from a large database. (iii) dialog act prediction , which is applied to classify the system action. (iv) response generation (Rojas- Barahona et al., 2017;Wen et al., 2016b;Li et al., 2017;Lei et al., 2018) to realize language surface form given the semantic constraint. In order to handle the massive number of entities in the response, Rojas-Barahona et al. (2017);Wen et al. (2016bWen et al. ( , 2015 suggest to break response generation into two steps: first generate delexicalized sentences with placeholders like <Res.Name>, and then post-process the sentence by replacing the placeholders with the DB record. I recommend Little Seoul, which has a Low price.

Post-Processing
I recommend <Res.Name>, which has a <Res.Price> price. Figure 3: Illustration of the neural dialog system. We decompose it into two parts: the lower part describes the dialog state tracking and DB query, and the upper part denotes the Dialog Action Prediction and Response Generation. In this paper, we are mainly interested in improving the performance of the upper part.
2016), CamRes767 (Rojas- Barahona et al., 2017) and KVRET (Eric et al., 2017), etc. However, a recently introduced multi-domain and large-scale dataset MultiWOZ (Budzianowski et al., 2018) poses great challenges to these approaches due to the large number of slots and complex ontology. Dealing with such a large semantic space remains a challenging research problem.
We follow the nomenclature proposed in Rojas- Barahona et al. (2017) to visualize the overview of the pipeline system in Figure 3, and then decompose it into two parts: the lower part (blue rectangle) contains state tracking and symbolic DB execution, the upper part consists of dialog act prediction and response generation conditioned on the state tracking and DB results. In this paper, we are particularly interested in the upper part (act prediction and response generation) by assuming the ground truth belief state and DB records are available. More specifically, we set out to investigate how to handle the large semantic space of dialog acts and leverage it to control the neural response generation. Our approach encodes the history utterances into distributed representations to predict dialog acts and then uses the predicted dialog acts to control neural response generation. The key idea of our model is to devise a more compact structured representation of the dialog acts to reduce the exponential growth issue and then incorporate the structural prior for the semantic space into the neural architecture design. Our proposed HDSA is inspired by the linguistically-inform self-attention (Strubell et al., 2018), which combines multi-head self-attention with multi-task NLP tasks to enhance the linguistic awareness of the model. In contrast, our model disentangles different heads to model different se-mantic conditions in a single task, which provides both better controllability and interpretability.

Dialog Act Representation
Dialog acts are defined as the semantic condition of the language sequence, comprising of domains, actions, slots, and values.
Tree Structure The dialog acts have universally hierarchical property, which is inherently due to the different semantic granularity. Each dialog act can be seen as a root-to-leaf path as depicted in Figure 4 2 . Such tree structure can capture the kinship between dialog acts, i.e. "restaurant-inform-location" has stronger similarity with "restaurant-inform-name" than "hotelrequest-address". The canonical approach to encode dialog acts is by concatenating the one-hot representation at each tree level into a flat vector like SC-LSTM (Wen et al., 2015;Budzianowski et al., 2018) (details are in in Github 3 ). However, such representation impedes the cross-domain transfer between different slots and the cross-slot transfer between different values (e.g the "recommend" under restaurant domain is different from "recommend" under hospital domain). As a result, the sample complexity can grow combinatorially as the potential dialog act space expands in large-scale real-life dialog systems, where the potential domains and actions can grow dramatically. To address such issue, we propose a more compact graph representation. Graph Structure The tree-based representation cannot capture the cross-branch relationship like "restaurant-inform-location" vs. "hotel-informlocation", leading to a huge expansion of the tree. Therefore, we propose to merge the cross-branch nodes that share the same semantics to build a compact acyclic graph in the right part of Figure 4 4 . Formally, we let A denote the set of all the original dialog acts. And for each act a ∈ A, we use H(a) = {b 1 , · · · , b i , · · · , b L } to denote its L-layer graph form, where b i is its one-hot representation in the i th layer of the graph. For example, a dialog act "hotel-inform-name" has a compact graph representation H(a) = {b 1 : More formally, let H 1 · · · H L denote the number of nodes at the layer of 1, · · · , L, respectively. Ideally, the total representation cost can be dramatically decreased from O( L i=1 H i ) tree-based representation to H 0 = L i=1 H i in our graph representation. Due to the page limit, we include the full dialog act graph and its corresponding semantics in the Appendix. When multiple dialog acts H(a) 1 , · · · , H(a) k are involved in the single response, we propose to aggregate them as A = BitOR(H(a) 1 , · · · , H(a) k ) as the H 0dimensional graph representation, where BitOR denotes the bit-wise OR operator 5 .
Generalization Ability Compared to the treebased representation, the proposed graph representation under strong cross-branch overlap can greatly lower the sample complexity. Hence, it leads to great advantage under sparse training instances. For example, suppose the ex-4 Model Figure 5 gives an overview of our dialog system. We now proceed to discuss its components below.

Dialog Act Predictor
We first explain the utterance encoder module, which uses a neural network f ACT to encode the dialog history (i.e., concatenation of previous utterances from both the user and the system turns x 1 , · · · , x m ), into distributed token-wise representations u 1 , · · · , u m with its overall representationū as follows: where f ACT can be CNN, LSTM, Transformer, etc,ū, u 1 , · · · , u m ∈ R D are the representation. The overall featureū is used to predict the hierarchical representation of dialog act. That is, we output a vector P θ (A) ∈ R H 0 , whose i th component gives the probability of the i th node in the dialog act graph being activated: where V a ∈ R D×H 0 is the attention matrix, the weights W u , W b , b are the learnable parameters to project the input to R D space, and σ is the Sigmoid function. Here, we follow Budzianowski . For convenience, we use θ to collect all the parameters of the utterance encoder and action predictor. At training time, we propose to maximize the cross-entropy objective L(θ) as follows: where · denotes the inner product between two vectors. At test time, we predict the dialog actŝ where T is the threshold and I is the indicator function.
Disentangled Self-Attention Recently, the selfattention-based Transformer model has achieved state-of-the-art performance on various NLP tasks such as machine translation (Vaswani et al., 2017), and language understanding (Devlin et al., 2018;Radford et al., 2018). The success of the Transformer is partly attributed to the multi-view representation using multi-head attention architecture. Unlike the standard transformer which concatenates vectors from different heads into one vector, we propose to uses a switch to activate certain heads and only pass through their information to the next level (depicted in the right of Figure 5). Hence, we are able to disentangle the H attention heads to model H different semantic functionalities, and we refer to such module as the disentangled self-attention (DSA). Formally, we follow the canonical Transformer (Vaswani et al., 2017) to define the Scaled Dot-Product Attention function given the input query/key/value features Q, K, V ∈ R n×D as: where n denotes the sequence length of the input, Q, K, V denotes query, key and value. Here, we use H different self attention functions with their independent parameterization to compute the multi-head representation G i as follows: where the input matrices Q, K, V are computed from the input token embedding x 1:n ∈ R n×D , and D denotes the dimension of the embedding. The i th head adopts its own parameters W Q i , We shrink the dimension at each head to D/H to reduce the computation cost as suggested in Vaswani et al. (2017). We first use the cross-attention network f AT T to incorporate the encoded dialog history u 1:m , and then we apply a position-wise feed forward neural network f P F F , a layer normalization f LM , and a linear projection layer f M LP to obtain G i ∈ R n×D . These layers are shared across different heads. The main innovation of our architecture lies in disentangling the heads. That is, instead of concatenating G i to obtain the layer output like the standard Transformer, we employ a binary switch vector s = (α 1 , . . . , α H ) ∈ {0, 1} H to control H different heads and aggregate them as a n × D output matrix G = H i=1 α i G i . Specifically, the j-th row of G, denoted as C j ∈ R D , can be understood as the output corresponding to the j-th input token y j in the response. This approach is similar to a gating function to selectively pass desired information. By manipulating the attention-head switch s, we can better control the information flow inside the self-attention module. We illustrate the gated summation over multi-heads in Figure 6. Hierarchical DSA When the dialog system involves more complex ontology, the semantic space can grow rapidly. In consequence, a single-layer disentangled self-attention with a large number of heads is difficult to handle the complexity. Therefore, we further propose to stack multiple DSA layers to better model the huge semantic space with strong compositionality. As depicted in Figure 3, the lower layers are responsible for grasping coarse-level semantics and the upper layers are responsible for capturing fine-level semantics. Such progressive generation bears a strong similarity with human brains in constructing precise responses. In each DSA layer, we feed the utterance encoding u 1:m and last layer output C 1:n as the input to obtain the newer output matrix G. We collect the output O 1:n = C 1:n from the last DSA layer to compute the joint probability over a observed sequence y 1:n , which can be decomposed as a series of product over the probabilities: 6 P β (y1:n|u1:m, s1:L) = where W v ∈ R D×V and b v ∈ R V are the projection weight and bias onto a vocabulary of size V , l ∈ {1, · · · , n} is the index, sof tmax denotes the softmax operation, s 1:L denotes the set of the attention switches s 1 , · · · , s L over the L layers, and β denotes all the decoder parameters.
Recall that the graph structure of dialog acts is explicitly encoded into HDSA as a prior, where each head in HDSA is set to model a designated semantic node on the graph. In consequence, the hierarchical representation A can be used to control the head switch s 1:L . At training time, the model parameters β are optimized from the training data triple (y 1:n , u 1:m , A) to maximize the likelihood of ground truth acts and responses given the dialog history. Formally, we propose to maximize the following objective function as follows: L(β) = log P β (y1:n|u1:m, s1:L = A) At test time, we propose to use the predicted dialog actÂ to control the language generation. The errors can be seen as coming from two sources, one is from inaccurate dialog act prediction, the other is from imperfect response generation.

Experiments
Dataset To evaluate our proposed methods, we use the recently proposed MultiWOZ dataset (Budzianowski et al., 2018) as the benchmark, which was specifically designed to cover the challenging multi-domain and large-scale dialog managements (see the summary in Table 1). This new benchmark involves a much larger dialog action space due to the inclusion of multiple domains and complex database backend. We represent the 625 potential dialog acts into a three-layered hierarchical graph that with a total 44 nodes (see Appendix for the complete graph). We follow Budzianowski et al. (2018)   Dialog Act Prediction We first train dialog act predictors using different neural networks to compare their performances. The experimental results (measured in F1 scores) are reported in Table 2.
Experimental results show that fine-tuning the pretrained BERT (Devlin et al., 2018) can lead to significantly better performance than the other models. Therefore, we will use it as the dialog act prediction model in the following experiments. Instead of jointly training the predictor and the response generator, we simply fix the trained predictor when learning the generator P β (y).

Automatic Evaluation
We follow Budzianowski et al. (2018) to use delexicalized-BLEU (Papineni et al., 2002), inform rate and request success as three basic metrics to compare the delexicalized generation against the delexicalized reference. We further propose Entity F1 (Rojas- Barahona et al., 2017) to evaluate the entity coverage accuracy (including all slot values, days, numbers, and reference, etc), and restore-BLEU to compare the restored generation against the raw reference. The evaluation metrics are detailed in the supplementary material. Before diving into the experiments, we first list all the models we experiment with as follows: 1. Without Dialog Act, we use the official code 7 : (i) LSTM (Budzianowski et al., 2018): it uses history as the attention context and applies belief state and KB results as side inputs. (ii) Transformer (Vaswani et al., 2017): it uses stacked Transformer architecture with dialog history as source attention context. In order to make these models comparable, we design different hidden dimensions to make their total parameter size comparable. We demonstrate  the performance of different models in Table 3, and briefly conclude with the following points: (i) by feeding the sparse tree representation to input/output layer (Transformer-in/out), the model is not able to capture the large semantics space of dialog acts with sparse training instances, which unsurprisingly leads to restricted performance gain against without dialog act input. (ii) the graph dialog act is essential in reducing the sample complexity, the replacement can lead to significant and consistent improvements across different models. (iii) the hierarchical graph structure prior is an efficient inductive bias; the structureaware HDSA can better model the compositional semantic space of dialog acts to yield a decent gain over Transformer-in/out with flattened input vector. (vi) our approaches yield significant gain (10+%) on the Inform/Request success rate, which reflects that the explicit structured representation of dialog act is very effective in guiding dialog response in accomplishing the desired tasks. (v) the generator is greatly hindered by the predictor accuracy, by feeding the ground truth acts, the proposed HDSA is able to achieve an additional gain of 7.0 in BLEU and 21% in Entity F1.

With Sparse
Generalization Ability To better understand the performance gain of the hierarchical graph-based representation, we design synthetic tests to examine its generalization ability. Specifically, we divide the dialog acts into five categories based on their frequency of appearance in the training data: very few shot (1-100 times), few shot (100-500 times), medium shot (500-2K times), many shot (2K-5K times), and very many shot (5K+ times). We compute the average BLEU score of the turns within each frequency category and plot the result in Figure 7. First, by comparing Transformer-in with compact Graph-Act against Transformer-in with sparse Tree-Act, we observe that for few number shots, the graph act significantly boosts the performance, which reflects our conjecture to lower sample complexity and generalize better to unseen (or less frequent) cases. Furthermore, by comparing Graph-Act Transformerin with HDSA, we observe that HDSA ahieves better results by exploiting the hierarchical structure in dialog act space.

Human Evaluation
Response Quality Owing to the low consistency between automatic metrics and human perception on conversational tasks, we also recruit trustful judges from Amazon Mechanical Turk  (AMT) (with prior approval rate >95%) 8 to perform human comparison between the generated responses from HDSA and SC-LSTM. Three criteria are adopted: (i) relevance: the response correctly answers the recent user query. (ii) coherence: the response is coherent with the dialog history. (iii) consistency: the generated sentence is semantically aligned with ground truth. During the evaluation, each AMT worker is presented two responses separately generated from HDSA and SC-LSTM, as well the ground truth dialog history. Each HIT assignment has 5 comparison problems, and we have a total of 200 HIT assignments to distribute. In the end, we perform statistical analysis on the harvested results after rejecting the failure cases and display the statistics in Table 4. From the results, we can observe that our model significantly outperforms SC-LSTM in the coherence, i.e., our model can better control the generation to maintain its coherence with the dialog history.
Semantic Controllability In order to quantitatively compare the controllability of HDSA, Graph-Act Tranformer-in, and SC-LSTM, we further design a synthetic NLG experiment, where we randomly pick 50 dialog history as the context from test set, and then randomly select 3 dialog acts and their combinations as the semantic condition to control the model's responses generation. We demonstrate an example in the supplementary to visualize the evaluation procedure. Quantitatively, we hire human workers to rate (measured in match, partially match, and totally mismatch) whether the model follows the given semantic condition to generate coherent sentences. The experimental results are reported in the bottom half of Table 4, which demonstrate that both the com-8 https://www.mturk.com/ pact dialog act representation and the hierarchical structure prior are essential for controllability.

Discussion
Graph Representation as Transfer Learning The proposed graph representation works well under the cases where the set of domain slotvalue pairs have significant overlaps, like Restaurant, Hotel, where the knowledge is easy to transfer. Under occasions where such exact overlap is scarce, we propose to use group similar concepts together as hypernym and use one switch to control the hypernym, which can generalize the proposed method to the broader domain.
Compression vs. Expressiveness A trade-off that we found in our structure-based encoding scheme is that: when multiple dialog acts are involved with overlaps in the action layer, ambiguity will happen under the graph representation. For example, the two dialog acts "restaurant-informprice" and "hotel-inform-location" are merged as "[restaurant, hotel] → [inform] → [price, location]", the current compressed representation is unable to distinguish them with "hotel-informprice" or "restaurant-inform-location". Though these unnatural cases are very rare in the given dataset without hurting the performance per se, we plan to address such pending expressiveness problem in the future research.

Conclusion and Future Work
In this paper, we propose a new semanticallycontrolled neural generation framework to resolve the scalability and generalization problem of existing models. Currently, our proposed method only considers the supervised setting where we have annotated dialog acts, and we have not investigated the situation where such annotation is not available. In the future, we intend to infer the dialog acts from the annotated responses and use such noisy data to guide the response generation.

A Details of Model Implementation
Here we detailedly explain the model implementation of the baselines and our proposed HDSA model. In the encoder side, we use a three-layered transformer with input embedding size of 64 and 4 heads, the dimension of query/value/key are all set to 16, in the output layer, the results of 4 heads are concatenated to obtain a 64-dimensional vector, which is the first broadcast into 256-dimension and then back-projected to 64-dimension. By stacking three layers of such architecture, we obtain at the end the series of 64-dimensional vectors. Following BERT, we use the first symbol as the sentence-wise representation u, and compute its matching score against all the tree node to predict the representation of dialog actsÂ. In the decoder, we adopt take as input any length features x 1 , · · · , x n , each with dimension of 64, in the first layer, since we have 10 heads, the dimension for each head is 6, thus the key, query feature dimensions are fixed to 6, the second layer with dimension of 9, the third with dimension of 2. The value feature is all fixed to 16, which is equivalent to the encoder side. After self-attention, the position-wise feed-forward neural network projects each feature back to 64 dimensions, which is further projected to 3.1K vocabulary dimension to model word probability.

B Automatic Evaluation
We simply demonstrate an example of our automatic evaluation metrics in Figure 9.

C Baseline Implementation
Here we visualize how we feed the dialog act input in as an embedding into the transformer to control the sequence generation process as Figure 8.

D Human Evaluation Interface
To better understand the human evaluation procedure, we demonstrate the user interface in Figure 10.

E Controllability Evaluation
To better understand the results, we depict an example in Figure 11, where 3 different dialog acts are picked as the semantic condition to constrain the response generation.

F Enumeration of all the Dialog Acts
Here we first enumerate the node semantics of the graph representation as follows: