Hierarchical Transformer for Task Oriented Dialog Systems

Generative models for dialog systems have gained much interest because of the recent success of RNN and Transformer based models in tasks like question answering and summarization. Although the task of dialog response generation is generally seen as a sequence to sequence (Seq2Seq) problem, researchers in the past have found it challenging to train dialog systems using the standard Seq2Seq models. Therefore, to help the model learn meaningful utterance and conversation level features, Sordoni et al. (2015b), Serban et al. (2016) proposed Hierarchical RNN architecture, which was later adopted by several other RNN based dialog systems. With the transformer-based models dominating the seq2seq problems lately, the natural question to ask is the applicability of the notion of hierarchy in transformer-based dialog systems. In this paper, we propose a generalized framework for Hierarchical Transformer Encoders and show how a standard transformer can be morphed into any hierarchical encoder, including HRED and HIBERT like models, by using specially designed attention masks and positional encodings. We demonstrate that Hierarchical Encoding helps achieve better natural language understanding of the contexts in transformer-based models for task-oriented dialog systems through a wide range of experiments.


Introduction
Dialog systems are concerned with replicating the human ability to make conversation. In a generative dialog system, the model aims at generating coherent and informative responses given a dialog * Equal Contributions 1 Experiments in this paper: https://github.com/ bsantraigi/HIER 2 PyTorch implementation of Hierarchical Transformer Encoder: https://github.com/bsantraigi/ hier-transformer-pytorch context and, optionally, some external information through knowledge bases (Wen et al., 2017) or annotations e.g. belief states, dialog acts etc. Zhao et al., 2017).
A dialog is usually represented as a series of utterances. However, it is not sufficient to view each utterance independently for engaging in a conversation. In a dialogue between humans, the speakers communicate both utterance level and dialog level information. E.g., dialog intent often cannot be detected by looking at a single utterance, whereas dialog acts are specific to each utterance and change throughout a conversation. Intuitively, we can instruct the model to achieve both utterance level and dialog level understanding separately through a hierarchical encoder (Serban et al., 2016).
There has been a lot of interest in the past towards using the Hierarchical Encoder-Decoder (HRED) model for encoding utterances in many RNN based dialog systems. However, since the rise of Transformers and self-attention (Vaswani et al., 2017), the use of hierarchy has not been explored further for transformer-based dialog models. Past research and user-studies have also shown that hierarchy is an important aspect of human conversation (Jurafsky, 2000). But, most previous works based on transformer have focused on training models either as language models (Budzianowski and Vulić, 2019;Zhang et al., 2020b) or as standard (non-hierarchical) Seq2Seq models Zhang et al., 2020a;Wang et al., 2020) with certain task specific extensions. Although arguably, the self-attention mechanism might automatically learn such a scheme during the training process, our empirical results show that forcing this inductive bias by manual design as proposed here leads to better performing models. This paper bridges these two popular approaches of transformers and hierarchical encoding for dialogs systems to propose a family of Hierarchical Transformer Encoders. Although arguably, the self-attention mechanism of standard encoders might automatically learn such a scheme during the training process, our empirical results show that forcing this inductive bias by manual design as proposed here leads to better performing models. Our contributions in this paper include: • We propose a generalized framework for hierarchical encoders in transformer based models that covers a broader range of architectures including existing encoding schemes like HRED/HIBERT  and possibly other novel variants. We call members of this family of hierarchical transformer encoders as an HT-Encoder. • Then, we formulate a straightforward algorithm for converting an implementation of standard transformer encoder into an HT-Encoder by changing the attention mask and the positional encoding. • Building upon that, we show how an HRED/HIBERT like hierarchical encoder (HIER-CLS) can be implemented using our HT-Encoder framework. • We also showcase a novel HT-Encoder based model, called HIER, with a context encoding mechanism different from HRED. We show that these simple HT-Encoder based baselines achieve at par or better performance than many recent models with more sophisticated architectures or training procedures. We make a thorough comparison with many recently proposed models in four different experimental settings for dialog response generation task. • We further apply HT-Encoder to a state-of-theart model, Marco (Wang et al., 2020), for taskoriented dialog systems and obtain improved results.

Models
Formally, the task of a dialog system is to predict a coherent response, r, given a dialog context c. In case of a goal oriented dialog system, context c might consist of dialog history, C t = [U 1 , S 1 , ..., U t ], and optionally a belief state (dialog act, slot values, intent etc.) b t , when available. Here, U i , S i represent the user and system utterances at turn i, respectively. The actual target response following C t is the system utterance S t .  Figure 1: Detailed architecture for a Hierarchical Transformer Encoder or HT-Encoder: The main inductive bias incorporated in this model is to encode the full dialog context hierarchically in two stages. This is done by the two encoders, 1) Shared Utterance Encoder (M layers) and 2) Context Encoder (N layers), as shown in the figure. Shared encoder first encodes each utterance (u 1 , u 2 , . . . , u t ) individually to extract the utterance level features. The same parameterized Shared Encoder is used for encoding all utterances in the context. In the second Context Encoder the full context is encoded using a single transformer encoder for extracting dialog level features. The attention mask in context encoder decides how the context encoding is done and is a choice of the user. This one depicted in the figure is for the HIER model described in Section 2.3. Only the final utterance in the Context Encoder gets to attend over all the previous utterances as shown. This allows the model to have access to both utterance level features and dialog level features till the last layer of the encoding process. Notation: Utterance i, u i = [w i1 , . . . , w i|ui| ], w ij is the word embedding for j th word in i th utterance.

Hierarchical Transformer Encoders (HT-Encoder)
Like the original HRED architecture, HT-Encoder also has two basic components, a shared utterance encoder and the context encoder. Shared utterance encoder, or the Shared Encoder in short, is the first phase of the encoding process where each utterance is processed independently to obtain utterance level representations. In the second phase, the Context Encoder is used to process the full context together. These context level representations are then used for the tasks like dialog state tracking or response generation. We propose two different types of Hierarchical Encoding schemes for the transformer model. Similarly, in HIER-CLS, the context encoder utilizes only a single utterance embedding for each utterance. We do this by taking the contextual embedding of the first token (often termed as the "CLS" token in transformer based models) of each utterance.
2. HIER: Recent works have shown the importance of contextual word embeddings. In HIER, we consider contextual embedding of all utterance tokens as input to the context encoder. We simply concatenate the whole sequence of contextual embeddings and forward it to the context encoder.

Conversion Algorithm: Standard Encoder to HT-Encoder
In this section, we show how the two-step process of hierarchical encoding can be achieved using a single standard transformer encoder. If we want to have an M layer utterance encoder followed by an N layer context encoder, we start with an (M + N ) layer standard encoder. Then by applying two separate masks as designed below, we convert the standard encoder into an HT-encoder. First, we need to encode the utterances independently. Within the self-attention mechanism of a transformer encoder, which token gets to attend to which other tokens is controlled by the attention mask. If we apply a block-diagonal mask, each block of size same as the length of utterances (as shown in , to the concatenated sequence of tokenized utterances, we effectively achieve the same process of utterance encoding. We call this block-diagonal mask for utterance encoding the UT-mask. Similarly, another attention mask (CT-Mask) can explain the context encoding phase that allows tokens to attend beyond the respective utterance boundaries. See the two matrices on In this example, the context comprises of three utterances of lengths 0, 1 and 2, respectively. C I indicates which utterance each of the tokens belongs to. The entries in P I denotes the relative position of each token with respect to utterance corresponding to it.

UT-Mask and Local Positional Encoding
The steps for obtaining the UT-Mask and positional encoding for the utterance encoder are given below and is accompanied by Figure 2. C is the dialog context to be encoded. w ij is the j th token of i th utterance. In C I , each index i is repeated |u i | (length of u i ) times. And C IR is a square matrix created by repeating C I . P I has the same dimensions as C I , and it stores the position of each token w ij in context C, relative to utterance u i . P : I → R d is the positional encoding function that takes an index (or indices) and returns their d-dim positional embedding. A is the UT-Mask for the given context C and their utterance indices C I . An example instance of this process is given in Figure 2. 1(.) is an indicator function that returns true when the input logic holds, and is applied to a matrix or vector element-wise. C IR = repeat(C I , len(C I ), 0)  CT-Masks for Models The attention masks for context encoding depends on the choice for model architecture. We provide the details of the architectures and their attention masks used in our experiments in the subsequent section. There are other masks possible, but these are the ones we found to be working best in their respective settings.

Model Architectures
We propose several model architectures to test the effectiveness of the proposed HIER-Encoder in various experimental settings. These architectures are designed to fit well with the four experimental settings (see Section 3.1) of the response generation task of the MultiWOZ dataset in terms of input and output.
The tested model architectures are as follows. Using the HIER encoding scheme described in Section 2.1, we test two model architectures for response generation, namely HIER and HIER++.
HIER: HIER is the most straightforward model architecture with an HT-Encoder replacing the encoder in a Transformer Seq2Seq. The working of the model is shown in Figure 3a. First, in the utterance encoding phase, each utterance is encoded independently with the help of the UT-Mask. In the second half of the encoder, we apply a CT-Mask as depicted by the figure's block attention matrix.
Block B ij is a matrix which, if all ones, means that utterance i can attend to utterance j's contextual token embeddings. The local and global positional encodings are applied, as explained in Section 2.2. A standard transformer decoder follows the HT-Encoder for generating the response.
The CT-Mask for HIER was experimentally obtained after trying a few other variants. The intuition behind this mask was that the model should reply to the last user utterance in the context. Hence, we design the attention mask to apply cross attention between all the utterances and the last utterance (see Figure 3a). HIER++: HIER++ is the extended version of the HIER model, as shown in Figure 3b, that also takes the dialog act label as input. The dialog act representation proposed in  consists of the domain, act, and slot values. A linear feedforward layer (FFN) acts as the embedding layer for converting their 44-dimension multi-hot dialog act representation. The output embedding is added to the input token embeddings of the decoder in HIER++ model. Similar to HDSA, we also use ground truth dialog acts during training, and predictions from a fine-tuned BERT model during validation and testing. HIER++ is applied to the Context-to-Response generation task of the MultiWOZ dataset.
HIER-CLS: As described in Section 2.1, the encoding scheme of HIER-CLS is more akin to the HRED  and HIBERT  models. It differs from HIER++ only with respect to the CT-Mask.
Ablations To understand the individual impact of UT-Mask and CT-Mask, we ran the same experiments with the following model ablations.
3. SET++: An alternative version of SET with dialog-act input to the decoder similar to HIER++.
HIER-Joint: Finally, we propose the HIER-Joint model 3 suitable for the end-to-end response generation task of the MultiWOZ dataset. The HIER-Joint model comprises an HT-Encoder and three transformer decoders for decoding belief state sequence, dialog act sequence, and response. It is jointly trained to predict all three sequences simultaneously. As belief state labels can help dialog-act generation, and similarly, both belief and act labels can assist response generation, we pass the token embedding from the belief decoder and act decoder to the response decoder. Act decoder receives mean token embedding from the belief decoder too.

Experimenal Framework
Our implementation is based on the PyTorch library. All the models use a vocabulary of size 1,505. We generate responses using beam search 4 with beam width 5. The model optimizes a cross entropy loss. Full details of model parameters are given in suplementary material.
Dataset We use MultiWOZ 5 (Budzianowski et al., 2018), a multi-domain task-oriented dataset. It contains a total of 10,400 English dialogs divided into training (8,400), validation (1,000) and test (1,000). Each turn in the dialog is considered as a prediction problem with all utterances upto that turn as the context.  (Vaswani et al., 2017) concatenates the utterances in dialog context to obtain a single source sequence and treats the task as a sequence transduction problem. HDSA (Chen et al., 2019) uses a dialog act graph to control the state of the attention heads of a Seq2Seq transformer model. Zhang et al. (2020a) proposes to augment the training dataset by building up a one-to-many state-to-action map, so that the system can learn a more balanced distribution for the action prediction task. Using this method they train a domain-aware multi-decoder (DAMD) network for predicting belief, action and response, jointly. As each agent response may cover multiple domains, acts or slots at the same time, Marco (Wang et al., 2020) learns to generate the response by attending over the predicted dialog act sequence at every step of decoding. SimpleTOD (Hosseini-Asl et al., 2020) and SOLOIST (Peng et al., 2020a) are both based on the GPT-2 (Radford et al., 2019) architecture. The main difference between these two architectures is that SOLOIST further pretrains the GPT-2 model on two more dialog corpus before fine-tuning on MultiWOZ dataset.

Task Settings:
Following the literature (Zhang et al., 2020a;Peng et al., 2020a), we now consider four different settings for evaluating the strength of hierarchical encoding.

No Annotations
First, to simply gauge the benefit of using a Hierarchical encoder in a Transformer Seq2Seq model, we compare the performance of HIER to other baselines including HRED and vanilla Transformer without any belief states and dialog act annotations.

Oracle Policy
In this setting, several recently proposed model architectures for the response generation task of MultiWOZ are compared against each other in presence of ground truth belief state and dialog act annotations. This experiment helps us understand the models' capabilities towards generating good responses (BLEU score) when true belief state and(or) dialog acts are available to them.

Context-to-Response
The model is given true belief states and DB search results in this experiment, but they need to generate the dialog act and response during inference. Some of the baselines generate dialog act as an intermediate step in their architecture whereas others use a fine-tuned BERT model.

End-to-End
This is the most realistic evaluation scheme where a model has to predict both belief states and dialog act (or one of these as per the models input requirement) for searching DB or generating response.

Evaluation Metrics
We used the official evaluation metrics 7 released by the authors of the MultiWOZ dataset (Budzianowski et al., 2018): Delexicalized-BLUE score, INFORM rate (measures how often the entities provided by the system are correct), SUC-CESS rate (reflects how often the system is able to answer all the requested attributes), Entity-F1 score (Wen et al., 2017) (measures the entity coverage accuracy), and Combined Score (S = BLEU + 0.5 × (Inf orm + Success)) to measure the overall quality.
Training Cross-entropy losses over the ground truth response and/or belief and act sequences are used for the training the models. We did hyperparameter search using the Optuna library (Akiba et al., 2019) by training the model upto 5 epochs. Final models were trained 8 upto 30 epochs with early stopping.

Results
For the four different experimental settings discussed in Section 3.1, we showcase results from those experiments in Tables 2 through 5. Table 2 shows the results from our experiments when no 7 https://github.com/budzianowski/ multiwoz 8 A system with two Tesla P100 GPUs were used for training. oracle is present. By comparing the performance of Transformer, SET and MAT baselines against that of HIER we can see that in each case HIER is able to improve in terms of BLEU, Success and overall Score. HIER being better than SET and MAT implies that only the UT-Mask or the CT-Mask is not sufficient, the full scheme of HT-Encoder is necessary for the improvement. The exception in the improvements is the SET model which has the highest inform score of 76.80. Although, we observe that it is the combination of the BLEU and Inform score that depicts the real quality of the responses. As BLEU measures precision of n-grams and inform measures recall of task related entities, only when both metrics increase we get a better performing model. This is reflected upto some extent in Entity-F1 score (H-Mean of entity recall and precision), but it too ignores tokens other than task related entities. So SET only having a higher inform score may mean that it is over-predicting some entities leading to improved recall.
In the Context-to-Response generation task with oracle policy (Table 3), our HIER++ and HIER-CLS models show very strong performance and beat the HDSA model (in terms of Inform and Success rates) and even the GPT-2 based baseline Sim-pleTOD (in terms of BLEU and Success rate). This shows that without the intricacies of the baselines, just by applying a hierarchical encoder based model we are able to perform almost at the level of the state-of-the-art model. Compared to HIER, Sim-pleTOD utilizes GPT-2's pretraining, and DAMD uses attention over previous belief states and action sequences. Whereas, HIER's access to oracle policy is only through the average embedding of its tokens. Table 5, we compare end-to-end generation performance of HIER-Joint with baseline models that can perform belief-state and/or dialog act generation. In terms of BLEU and combined score HIER-Joint is able to perform better than the baselines. With respect to inform and success the model outperforms the DAMD baseline.

Further in
While the above experiments focus on proving the base performance of the proposed response generation models (HIER, HIER++, HIER-CLS, and ablations), HT-Encoder can be applied to any model that uses a standard transformer encoder. Hence, in a final experiment (    Marco with an HT-Encoder and rerun the contextto-response generation experiment. Introducing HT-Encoder into Marco helps improve in terms of inform (minor), success and the combined score metric. The results of this experiment show that HT-Encoder is suitable for any model architecture.
Overall, our experiments show how useful the proposed HT-Encoder module can be for dialog sys-tems built upon transformer encoder-decoder architecture. It is also applicable to tasks where the input sequence can be split into an abstract set of subunits (e.g., search history in Sordoni's application). We believe that our proposed approach for hierarchical encoding in transformers and the algorithm for converting the standard transformer encoder makes it an invaluable but accessible resource for   future researchers working on dialog systems or similar problem statements with transformer-based architectures.

Related Works
Task Oriented Dialog Systems Researchers identify four different subtasks for any taskoriented dialog system (Wen et al., 2017), natural language understanding (NLU), dialog state tracking (DST), dialog act or policy generation, and Natural Language Generation (NLG). Before the advent of large scale Seq2Seq models, researchers focused on building feature-rich models with rulebased pipelines for both natural language understanding and generation. It usually required separate utterance-level and dialog-level NLU feature extraction modules. These NLU features decide the next dialog act that the system should follow. This act is then converted into a natural language response using the NLG module. Young et al. (2013) modeled this problem as a Markov Decision Process whose state comprised of various utterance and dialog features detected by an NLU module. However, such models had the usual drawback of any pipelined approaches, error propagation. Wen et al. (2017) proposed using neural networks for extracting features like intent, belief states, etc. and training the NLU and NLG modules end-to-end using a single loss function. Marco (Wang et al., 2020) and HDSA ) used a finetuned BERT model as their act predictor as it often triumphs other ways to train the dialog policy network (even joint learning). HDSA is a transformer Seq2Seq model with act-controllable self-attention heads (in the decoder) to disentangle the individual tasks and domains within the network. Marco uses a soft-attention over the act sequence during the response generation process.

Hierarchical Encoders
The concept of Hierarchical Encoders have been used in many different context in the past. It has been most well known in the area of dialog response generation as the HRED model. Many open domain dialog systems have used the hierarchical recurrent encoding scheme of HRED for various tasks and architectures. Hierarchical Encoder was first proposed by (Sordoni et al., 2015a) for using in a query suggestion system. They used it encode the user history comprising multiple queries using an Hierarchical LSTM network. Serban et al. (2016) extended this work to open domain dialog generation problems and proposed the HRED network. HRED captures the high level features of the conversation in a context RNN. Several models have adopted this approach later on, e.g. VHRED (Serban et al., 2017), CVAE (Zhao et al., 2017), DialogWAE (Gu et al., 2018), etc. Another area in which researchers have proposed the use of hierarchical encoder is for processing of paragraph or long documents. Li et al. (2015) used a hierarchical LSTM network for training an autoencoder that can encode and decode long paragraphs and documents.  proposed HIBERT where they introduced hierarchy into the BERT architecture to remove the limitation on length of input sequence. HIBERT samples a single vector for each sentence or document segment (usually contextual embedding of CLS or EOS token) from the sentence encoder to be passed onto the higher level transformer encoder. Liu and Lapata (2019) applies a similar approach for encoding documents in a multi-document summarization task.

Conclusion
This paper explored the use of hierarchy in transformer-based models for task-oriented dialog system. We started by proposing a generalized framework for Hierarchical Transformer Encoders (HT-Encoders). Using that, we implemented two models, one new model called HIER, and another HIER-CLS model by adapting the existing HIB-ERT architecture into our framework. We thoroughly experimented with these models in four different response generation tasks of the Multi-WOZ dataset. We compared the proposed models with an exhaustive set of recent state-of-the-art models to thoroughly analyze the effectiveness of HT-Encoders. We empirically show that the basic transformer seq2seq architecture, when equipped with an HT-Encoder, outperforms many of the stateof-the-art models in each experiment. We further prove its usefulness by applying it to an existing model Marco. This work opens up a new direction on hierarchical transformers in dialogue systems where complex dependencies exist between the utterances. It would also be beneficial to explore the effectiveness of the proposed HT-Encoder when applied for various other tasks.