Deep Learning for Dialogue Systems

In the past decade, goal-oriented spoken dialogue systems have been the most prominent component in today’s virtual personal assistants. The classic dialogue systems have rather complex and/or modular pipelines. The advance of deep learning technologies has recently risen the applications of neural models to dialogue modeling. However, how to successfully apply deep learning based approaches to a dialogue system is still challenging. Hence, this tutorial is designed to focus on an overview of the dialogue system development while describing most recent research for building dialogue systems and summarizing the challenges, in order to allow researchers to study the potential improvements of the state-of-the-art dialogue systems. The tutorial material is available at http://deepdialogue. miulab.tw.


Tutorial Overview
With the rising trend of artificial intelligence, more and more devices have incorporated goal-oriented spoken dialogue systems. Among popular virtual personal assistants, Microsoft's Cortana, Apple's Siri, Amazon Alexa, and Google Assistant have incorporated dialogue system modules in various devices, which allow users to speak naturally in order to finish tasks more efficiently.
Traditional conversational systems have rather complex and/or modular pipelines. The advancement of deep learning technologies has recently risen the applications of neural models to dialogue modeling. Nevertheless, applying deep learning technologies for building robust and scalable dialogue systems is still a challenging task and an open research area as it requires deeper understanding of the classic pipelines as well as detailed knowledge on the benchmark of the models of the prior work and the recent state-of-the-art work.
The goal of this tutorial is to provide the audience with the developing trend of dialogue systems, and a roadmap to get them started with the related work. The first section motivates the work on conversationbased intelligent agents, in which the core underlying system is task-oriented dialogue systems. The following section describes different approaches using deep learning for each component in the dialogue system and how it is evaluated. The last two sections focus on discussing the recent trends and current challenges on dialogue system technology and summarize the challenges and conclusions. The detailed content is described as follows.

Dialogue System Basics
This section will motivate the work on conversation-based intelligent agents, in which the core underlying system is task-oriented spoken dialogue systems.
The section starts with an overview of the standard pipeline framework for dialogue system illustrated in Figure 1 (Tur and De Mori, 2011). Basic components of a dialog system are automatic speech recognition (ASR), language understanding (LU), dialogue management (DM), and natural language generation (NLG) (Rudnicky et al., 1999;. This tutorial will mainly focus on LU, DM, and NLG parts.  Language Understanding Traditionally, domain identification and intent prediction are framed as utterance classification problems, where several classifiers such as support vector machines and maximum entropy have been employed (Haffner et al., 2003;Chelba et al., 2003;. Then slot filling is framed as a word sequence tagging task, where the IOB (in-out-begin) format is applied for representing slot tags, and hidden Markov models (HMM) or conditional random fields (CRF) have been employed for slot tagging (Pieraccini et al., 1992;Wang et al., 2005;Raymond and Riccardi, 2007).
Dialogue Management A partially observable Markov decision process (POMDP) has been shown to be beneficial by allowing the dialogue manager to be optimized to plan and act under the uncertainty created by noisy speech recognition and semantic decoding (Williams and Young, 2007;. The POMDP policy controlling the actions taken by the system is trained in an episodic reinforcement learning (RL) framework whereby the agent receives a reinforcement signal after each dialogue (episode) reflecting how well it performed (Sutton and Barto, 1998). In addition, the dialogue states should be tracked in order to measure the belief of the current situation during the whole interaction Sun et al., 2014).
Natural Language Generation There are two NLG approaches, one focuses on generating text using templates or rules (linguistic) methods, the another uses corpus-based statistical techniques (Oh and Rudnicky, 2002). Oh and Rudnicky showed that stochastic generation benefits from two factors: 1) it takes advantage of the practical language of a domain expert instead of the developer and 2) it restates the problem in terms of classification and labeling, where expertise is not required for developing a rule-based generation system.

Deep Learning Based Dialogue System
With the power of deep learning, there is increasing research work focusing on applying deep learning for each component.
Language Understanding With the advances on deep learning, neural models have been applied to domain and intent classification tasks (Sarikaya et al., 2011;Tur et al., 2012;Sarikaya et al., 2014). Ravuri and Stolcke (2015) first proposed an RNN architecture for intent determination. For slot filling, deep learning has been viewed as a feature generator and the neural architecture can be merged with CRFs (Xu and Sarikaya, 2013). Yao et al. (2013) and Mesnil et al. (2015) later employed RNNs for sequence labeling in order to perform slot filling. Such architectures have later been extended to jointly model intent detection and slot filling in multiple domains (Hakkani-Tür et al., 2016;Jaech et al., 2016). Recently, Zhai et al. (2017) proposed to tag the semantic labels together with segmentation and achieved the state-of-the-art performance.
In addition, how to leverage contextual information and prior linguistic knowledge performs better understanding is an important issue. End-to-end memory networks have been shown to provide a good mechanism for integrating longer term knowledge context and shorter term dialogue context into these models (Chen et al., 2016b;Chen et al., 2016c). In addition, the importance of the LU module is investigated in Li et al. (2017a), where different types of errors from LU may degrade the whole system performance in an reinforcement learning setting.
Dialogue Management -Dialogue State Tracking The state-of-the-art dialogue managers focus on monitoring the dialogue progress by neural dialogue state trackers. Among the initial models are the RNN based dialogue state tracking approaches (Henderson et al., 2013) that has shown to outperform Bayesian networks . More recent work that provided conjoint representations between the utterances, slot-value pairs as well as knowledge graph representations Mrkšić et al., 2016) demonstrated that using neural dialogue models can overcome current obstacles of deploying dialogue systems in larger dialogue domains. Rastogi et al. (2017) also proposed a multi-domain dialogue state tracker to achieve effective and efficient domain adaptation.
Dialogue Management -Dialogue Policy Optimization The dialogue policy can be learned in either a supervised or a reinforcement learning manner . The reinforcement learning based dialogue agent has been recently developed in different tasks and shown applicable for interactive scenarios (Li et al., 2017b;Dhingra et al., 2017;Shah et al., 2016). In order to enable reinforcement learning, a simulated environment is required. Several approaches are proposed for building user simulators as the interactive environment El Asri et al., 2016;Crook and Marin, 2017), so that the dialogue policy can be trained in a reinforcement framework.
Natural Language Generation The RNN-based models have been applied to language generation for both chit-chat and task-orientated dialogue systems (Vinyals and Le, 2015;Wen et al., 2015b). The RNN-based NLG can learn from unaligned data by jointly optimizing sentence planning and surface realization, and language variation can be easily achieved by sampling from output candidates (Wen et al., 2015a). Moreover, Wen et al. (2015b) improved the prior work by adding a gating mechanism for controlling the dialogue act during generation in order to avoid semantics repetition, showing promising results. Several aspects of improvement have been achieved using contextual and structured information (Dušek and Jurcicek, 2016;Nayak et al., 2017;Su et al., 2018)

Recent Trends and Challenges on Learning Dialogues
This part will focus on discussing the recent trends and current challenges on dialogue system technology.
End-to-End Learning for Dialogue Systems With the power of neural networks, there are more and more attempts for learning dialogue systems in an end-to-end fashion. Different learning frameworks are applied, including supervised learning and reinforcement learning. This part will discuss the work about end-to-end learning for dialogues Williams and Zweig, 2016;Zhao and Eskenazi, 2016;Li et al., 2017b).
Recent advance of deep learning has inspired many applications of neural models to dialogue systems. (2016) introduced a network-based end-to-end trainable taskoriented dialogue system, which treated dialogue system learning as the problem of learning a mapping from dialogue histories to system responses, and applied an encoder-decoder model to train the whole system. However, the system is trained in a supervised fashion, thus requires a lot of training data, and may not be able to explore the unknown space that does not exist in the training data for an optimal and robust policy. Zhao and Eskenazi (2016) first presented an end-to-end reinforcement learning (RL) approach to dialogue state tracking and policy learning in the DM. This approach is shown to be promising when applied to a task-oriented system, which is to guess the famous person a user thinks of. In the conversation, the agent asks the user a series of Yes/No questions to find the correct answer.  proposed an end-to-end differentiable KB-Infobot to improve the flexibility of question types and robustness. Li et al. (2017b) further presented an end-to-end neural dialogue system for completing tasks, which supported flexible question types, allowed user-initiated requests during conversation, and finally achieved better robustness. Human feedback is also effectively leveraged into the learning framework for on-line training in an end-to-end manner (Liu et al., 2018).

Wen et al. (2016) and Bordes and Weston
Dialogue Breadth In order to extend the coverage of the systems, transfer learning has been applied to different extended systems in order to proceed to a multi-domain scenario. Chen et al. (2016a) transfered the dialogue acts across different domains so that the performance of the newly-developed domain can be boosted. Kim et al. (2016) proposed to learn a domain-specific and domain-independent information in order to transfer the shared knowledge more efficiently and effectively. In addition, Gašić et al. (2015) presented the policy committee in order to boost the performance for policy training in a new domain. All above work extended the dialogue coverage using different directions.

Dialogue Depth
Most current systems focus on knowledge-based understanding, but there are hierarchical understanding according to the dialogue complexity. For example, an intent about party scheduling may include restaurant reserving and invitation sending. Sun et al. (2016) learned the high-level intentions that span on multiple domains in order to achieve common sense understanding. Moreover, a more complex dialogue such as "I feel sad..." requires empathy in order to generate the suitable response. Fung et al. (2016) first attempted to leverage different modalities for emotion detection and built an emotion-aware dialogue system.
Given two branches of development, the ultimate goal is to build an open-domain dialogue system (coverage) with all levels of understanding (depth).

Tutorial Instructors
Yun-Nung (Vivian) Chen is an assistant professor in the Department of Computer Science and Information Engineering at National Taiwan University. Her research interests focus on spoken dialogue system, language understanding, natural language processing, deep learning, and multimodality. She • Affiliation: National Taiwan University, Taipei, Taiwan  (ICSI, 2006(ICSI, -2010 and AT&T Labs-Research (2001-2005. She received her BSc degree from Middle East Technical Univ, in 1994, and MSc and PhD degrees from Bilkent Univ., Department of Computer Engineering, in 1996 and 2000, respectively. Her research interests include natural language and speech processing, spoken dialogue systems, and machine learning for language processing. She has over 50 patents that were granted and co-authored more than 200 papers in natural language and speech processing. She is the recipient of three best paper awards for her work on active learning for dialogue systems, from IEEE Signal Processing Society, ISCA and EURASIP. She was an associate editor of IEEE Transactions on Audio, Speech and Language Processing (2005)(2006)(2007)(2008)