PyOpenDial: A Python-based Domain-Independent Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules

We present PyOpenDial, a Python-based domain-independent, open-source toolkit for spoken dialogue systems. Recent advances in core components of dialogue systems, such as speech recognition, language understanding, dialogue management, and language generation, harness deep learning to achieve state-of-the-art performance. The original OpenDial, implemented in Java, provides a plugin architecture to integrate external modules, but lacks Python bindings, making it difficult to interface with popular deep learning frameworks such as Tensorflow or PyTorch. To this end, we re-implemented OpenDial in Python and extended the toolkit with a number of novel functionalities for neural dialogue state tracking and action planning. We describe the overall architecture and its extensions, and illustrate their use on an example where the system response model is implemented with a recurrent neural network.


Introduction
Spoken dialogue systems (SDSs) allow interactions between users and machines through natural language conversations. These systems are composed of a broad range of components such as speech recognition, language understanding, dialogue management, language generation, and speech synthesis. Recent SDS frameworks, such as AT&T Statistical Dialogue Toolkit (Williams et al., 2010), OpenDial (Lison and Kennington, 2016), and PyDial (Ultes et al., 2017) aim to integrate these complex and diverse components through a modular architecture.
Spoken dialogue systems may either adopt symbolic or statistical approaches to perform lan- * *:These authors contributed equally.  guage understanding, dialogue management and language generation. Statistical approaches, including but not limited to (Young et al., 2013;Ultes et al., 2017) rely on probabilistic models of dialogue interactions and allows these models to be estimated from data. Symbolic approaches, on the other hand, model the dialogue interaction using finite-state automata or logical methods designed by the developer. Of our particular interest is OpenDial, a Java-based open-source toolkit that combines the benefits of both statistical and symbolic approaches.
In recent years, deep learning has shown to achieve promising performance results in many tasks related to dialogues, such as speech recognition (Graves et al., 2013), language understanding (Radford et al., 2018), dialogue management (Williams et al., 2017) and language generation . Given that one of the most popular programming languages for deep learning libraries is Python, e.g. Tensorflow (Abadi et al., 2016) and PyTorch (Paszke et al., 2017), the integration of neural conversational models would be a straightforward task if OpenDial itself was written in Python.
To this end, we developed PyOpenDial, an open-source SDS framework that re-implements OpenDial in Python and integrates a range of novel functionalities, such as the possibility to track neural dialogue state and the use of Monte Carlo Tree Search for forward planning. PyOpen-Dial inherits the original architectural design of OpenDial, and extends the XML domain specification so that various deep learning models can be directly used for SDS components. We include the negotiation dialogue domain in (Lewis et al., 2017) as an example to show how to integrate with external components trained by deep learning.

Dialogue State
PyOpenDial inherits the information-state framework in OpenDial (Larsson and Traum, 2000): All the components in the toolkit operates on the shared information state that represents the dialogue state arising from the interaction between the user and the machine (see Figure 1). The dialogue state is represented by a Bayesian network (BN), a directed graphical model for encoding the probability distribution of the dialogue state. Hence, the dialogue state consists of a factored representation of state variables on a dialogue, which are random variables, and conditional dependencies between them.

Workflow
PyOpenDial adopts the probabilistic rules used in OpenDial (Lison, 2014;Lison and Kennington, 2016) to update the dialogue state represented by a Bayesian Network during the conversation. These rules follow a if...then...else skeleton that map logical conditions on a subset of state variables to a probability or utility distribution on another subset of state variables. Two types of rules are provided: probability rules and utility rules. The probability rule defines the probabilistic change of state variables through a probability distribution over effects, each of which is an assignment on the state variables, given a logical condition of state variables. The utility rule defines utilities on the values of action variables given a logical conditions of state variables. These probabilistic rules are specified in the domain XML file. Examples are shown in Listing 1, where line 13-23 contains the probability rule and line 27-38 contains the utility rule.
model is associated with a subset of state variables, called trigger variables. Each model monitors the change of its trigger variables. When one or more trigger variables are updated during a conversation, the probabilistic rules on the corresponding model are applied to the dialogue state by instantiating the rule with the current dialogue state. Note that these updates may result to trigger other models, hence the procedure causes a chain of updates on the dialogue state through the probabilistic rules of the models. In summary, the workflow of PyOpenDial is basically a series of applications of probabilistic rules to the dialogue state.
Since this workflow is basically the same as in OpenDial, we refer the readers to Lison (2014), Lison and Kennington (2016), and the OpenDial toolkit website (http://www. opendial-toolkit.net) for more details on the XML specification of the domain modeling.

Extensions to OpenDial
Custom Variable Types Neural models such as recurrent neural networks (RNNs) have become a popular choice for various dialogue processing tasks, given their capability to be trained end-toend and infer complex latent representations of the dialogue state. In these models, the dialogue state is typically represented as a vector-valued prediction computed from a complex mapping from input to output. In contrast, the original OpenDial only supports updates on primitive data types (e.g. boolean, double, string, double array, and set) via human-readable probabilistic rules similar to decision trees. In order to overcome this limitation, we extend the specification of the dialogue state to include complex variable types and functional values to allow arbitrarily complex mappings of conditional variables, e.g. latent vectors in neural models that encode the dialogue context. This is achieved by extending the domain XML specification to allow for variable types expressed through complex functions that can be integrated in probabilistic rules. In Listing 1, two such custom variables are defined: movie rnn and music rnn, which are instances of MovieRNN and MusicRNN respectively (line 2-7), representing pre-trained RNN-based generation models. Line 10 assigns gen u m to function generate in module chatbot, which executes the neural model. This particular function takes two arguments (namely the generation model and a user utterance) as input and returns the system utterance as output.
Predictive models Dialogue management is, at its core, a sequential decision-making problem, where the goal of the system is to select actions that fulfill the system objectives while minimising associated costs. One way to achieve this objective is through forward planning, i.e. enabling the system to search for actions that yield the maximum expected utility over a given horizon. Forward planning requires the specification of predictive models (such as user simulation models) to be able to predict the consequences of the system actions on the current interaction. PyOpenDial provides the planning-only option to models, which makes the model triggered only when forward planning is performed. This allows the system to explicitly differentiate between observed and predicted values in the dialogue state. The specific use-case of this feature is described in Section 4 with Listing 2 (line 30).

Implementation
PyOpenDial is implemented in Python and is released under the MIT opensource license. The toolkit is available through the GitHub code repository at (https://github.com/ KAIST-AILab/PyOpenDial) 1 . The toolkit additionally provides a graphical user interface that helps fast-prototyping and test-driving the system. The graphical user interface, shown in Figure 2, displays the domain and dialogue history and take the user's next input (text or speech).
PyOpenDial implements all the core components and modules of OpenDial, including the BN inference algorithm and the probabilistic rule engine. We also introduced a number of new modules, including the Monte-Carlo tree search (MCTS) (Kocsis and Szepesvári, 2006) planner and the basic speech-to-text and text-to-speech modules using Google Speech APIs.

MCTS Planner
In dialogue management, planning algorithms are often used to search for the system action that maximizes the sum of utilities in the long-term horizon so as to optimally react to the user.
The baseline planning algorithm in OpenDial is a lookahead forward planner that fully expands the search tree up to the planning horizon H: where b is the dialogue state, a is the value of the action variables, o is the possible observation when taking action a in the state b, b ao is the dialogue state updated from b after action a and observation o, γ ∈ [0, 1) is the discount factor, U (b, a) is the instantaneous utility of a at the dialogue state b, and Q H (b, a) = 0 for all b, a. After computing Q 0 (b, a) from the recursive equation, the forward planner chooses the final value of the action variable given by argmax a Q 0 (b, a). A major limitation of the forward planner is that the search becomes infeasible in the planning horizon as well as the branching factor (i.e. the number of candidate actions and observations).
On the other hand, MCTS combines tree search with Monte-Carlo simulation so that the search effort is non-uniformly invested into promising nodes. One of the most basic MCTS algorithms is UCT (Kocsis and Szepesvári, 2006), which performs iterative simulation on the search tree by following the UCB rule to select actions at intermediate nodes where c is the exploration constant that balances exploration and exploitation trade-off, Q(b, a) is the average of the sampled sum of utilities, N (b) is the number of simulations performed through the dialogue state b, N (b, a) is the number of times action a is selected in b. More recent versions of MCTS algorithms have shown great successes in many large sequential-decision making problems such as playing Go (Silver et al., 2016).
PyOpenDial includes an MCTS planner, and we shall demonstrate its effectiveness in the next section by comparing its performance with the forward planner, using the negotiation dialogue domain that requires long-term planning.
domain, two agents (i.e. the user and the system) negotiate on 3 types of items, and the negotiation domain has the following unique characteristics: (1) the simulated utterances of the system and the user are generated from the RNN of system and user models, which is done seamlessly thanks to Python-based implementation of the framework, (2) a long-term dialogue planning is required to get a high reward in the negotiation since the utility signal is given only at the very end of the potentially very long dialogue; thus an MCTS planner is desirable.

Domain Description
In the negotiation dialogue domain, 3 types of items (i.e. books, hats, balls) are divided between two agents through natural language dialogue. There is a finite amount of each item (5 to 7 total items and 1 to 4 individual items), and the agents have different utility functions that represent the agent's preference. The utility function for each agent is defined randomly while satisfying the following constraints: (1) The maximally achievable utility for each agent should be 10; (2) Each item must always have a non-zero utility for at least one agent; (3) At least one item must always have a non-zero utility for both agents. If an agreement is reached at the end of the negotiation, each agent receives a reward equal to the total utility of obtained items. If the decisions are in conflict, both agents receive a reward of 0. Figure 2 shows the negotiation dialogue example between user and system in PyOpenDial, and Listing 2 presents a simplified version of the domain XML specification.

RNN-based Natural Language Generation Model
In implementing this dialogue system, we first pretrained an RNN model that imitates negotiation dialogues between two humans, following the supervised training scheme described in (Lewis et al., 2017). This RNN model has the ability to generate natural language utterances, taking into account the previous dialogue history and the given context (i.e. value and count of each item). We use this RNN to generate candidates of system utterances (line 19-25 in Listing 2) and to generate user utterances for the user simulation model that is used during multi-horizon planning (line 37 in Listing 2). This RNN model uses PyTorch and is imported into PyOpenDial as decribed next.

Domain XML Specification
In this section, we briefly explain how the negotiation domain is specified in the XML format shown in Listing 2, which is an abbreviation of the full version distributed with PyOpenDial.
Declaration We declare rnn state variable, an instance of NegotiationRNN class (line 2-4). The class has a pre-trained RNN model, described in the previous section, as a member variable and generates actual (user or system) utterance through the RNN model. To represent the dialogue history, we also declare a state variable h (line 5), which maintains the user and system utterances up to the current turn. We then declare two functions, gen u m and gen u u, to generate utterances (line 9-10). gen u m is used to generate the set of candidate system utterances and gen u u is used as the user simulator in the planner to search for the best system action that maximizes the overall utility within the planning horizon. Finally, we declare the function reward which returns the reward at the end of the negotiation (line 11) System Utterance Generation The utility rule specifies the utility associated with each candidate system action. In order to harness the system utterance directly generated by NegotiationRNN and support dialogue planning, we add 20 effects in the utility rule, each corresponding to a system utterance sampled from NegotiationRNN, and assign the same immediate utility of 0.001 (line 13-28) 3 . The actual, final utility is decided only when the negotiation has finished, and the planner described next will search for the best system utterance (among 20 candidates) using long-term planning.
Planning The planner requires a user simulation model for long-term planning. The user simulation model is given in line 30-41. Note that we set the planning-only tag for the user simulation model (line 30), in order to prevent the user simulation model from overwriting the actual user utterance u u during planning. At the end of simulated dialogue, the final utility determined by negotiation is obtained from the python function reward. The variable current step is set to "Terminated" to represent the end of a dialogue.

Experiments
Using the negotiation dialogue domain, we compare the performances of two planning algorithms, the forward planner and the MCTS planner, and a naive baseline that only maximizes the immediate utility without planning. The planning horizons were set to 3 for the forward planner and 7 for the MCTS planner, which made both planners take approximately same amount of search time.
As reported in Table 1, planning (using either Forward or MCTS) improves the negotiation outcome over the baseline in terms of both reward and agreement rate, and the MCTS planner further outperforms the forward planner. This is mainly due to the fact that the reward signal in the negotiation domain only comes at the very end of the dialogue, thus in the early stages of the dialogue, no meaningful reward signal can be obtained within the short planning horizon of the forward planner.
In contrast, MCTS performs Monte-Carlo simulations all the way towards the end of the dialogue in most cases and thus captures the final utility.

Conclusion
In this paper, we presented PyOpenDial, a Pythonbased open-source dialogue system toolkit that inherits the architectural design of OpenDial and extends the domain XML specification for integrating deep learning models.