MACA: A Modular Architecture for Conversational Agents

We propose a software architecture designed to ease the implementation of dialogue systems. The Modular Architecture for Conversational Agents (MACA) uses a plug-n-play style that allows quick prototyping, thereby facilitating the development of new techniques and the reproduction of previous work. The architecture separates the domain of the conversation from the agent’s dialogue strategy, and as such can be easily extended to multiple domains. MACA provides tools to host dialogue agents on Amazon Mechanical Turk (mTurk) for data collection and allows processing of other sources of training data. The current version of the framework already incorporates several domains and existing dialogue strategies from the recent literature.


Introduction
Recent research in building sophisticated AIbased dialogue management systems has led to many new models supporting goal oriented or chit-chat style dialogue agents. These models have been applied to a variety of consumer domains, such as restaurant booking (Kim and Banchs, 2014), flight booking (Young, 2006), etc. However, the lack of tools for easy prototyping of newer models remains an impediment to developing new models and properly benchmarking against previous models. Furthermore, the different types of conversational agents-e.g., generative (Hochreiter and Schmidhuber, 1997;Serban et al., 2015Serban et al., , 2016, retrieval-based (Schatzmann et al.,* phuoc.truong2@mail Lowe et al., 2015a), slot-based (Young, 2006) or POMDP agents (Png and Pineau, 2011)have different working mechanisms, which pose challenges to the development of a unified platform for conversational agents with multi-domain support.
To address this gap, we propose a new, readyto-use, cross-platform framework for text-based conversational agents -MACA 1 (Modularized Architecture for Conversational Agents)-that supports plug-n-play use of several existing dialogue agents, as well as facilitates easy prototyping of new dialogue agents. The architecture simplifies the specification of different types of dialogue agents and plugs in an already-built dialogue agent. The framework also maintains a clear separation between domain knowledge and the dialogue agent, which improves agent and domain knowledge reusability. MACA separates task definition from task selection and thereby supports multi-task agents that can extend to multiple turns.
The key characteristics of the MACA framework include: • strong separation between domain knowledge and a dialogue agent • a unified architecture to support goaloriented, POMDP, generative, and retrievalbased dialogue agents • easy plug-n-play of custom-built agents • multi-task support for domain specification • reusability of slots across different tasks • tool to collect data from mTurk with ease • template to construct dialogue agents within the framework • independence from dialogue agents' implementation libraries • open source code ready for public sharing

Related Work
There are a few proposed frameworks in recent years that provide easy prototyping of dialogue agents.
Ravenclaw (Bohus and Rudnicky, 2003), proposed as a successor to Agenda (Allen et al., 2001), is a two-tiered dialogue architecture supporting rapid development of dialogue agents. This flexible architecture provides a clear separation between the domain knowledge and dialogue agent, and maintains a hierarchical task structure. Systems can be built on the architecture with the hierarchical task layout but adding a new task requires the hierarchy to be rebuilt, which impedes application to new domains.
A hierarchical architecture similar to Ravenclaw, called Task Completion Platform (TCP) (Crook et al., 2016), addresses domain knowledge extensibility with minimal changes to a configuration file. In addition, it allows the goal oriented tasks to be defined easily using a TaskForm language to maintain slot information. Although TCP facilitates extension of slot-based agents to multiple domains, it cannot be extended for other dialogue agent types viz., generative models and retrieval models.
Another notable architecture is ClippyScript (Seide and McDirmid, 2012), but its task definition is tied to a task condition by rule. Rules are therefore constrained to be explicitly defined on a per task basis. This is significantly more restrictive than our proposed architecture.
As much research focuses on proposing dif-ferent architectures for dialogue models, there have also been some progress made in proposing efficient protocols for agent-agent interaction such as DialPort (Zhao et al., 2016), which provides tools for enabling multi-modal interaction between agents. Our proposed work is different from this line of research, focusing on a unifying architecture for dialogue agents and little on the inter-agent communication.

Architecture Description
An overview of the Modular Architecture for Conversational Agents (MACA) is presented in Figure 1. The system is setup as a pipeline with six major components: Input, Pre-processing, Dialogue Model, Post-processing, Output, and Listeners. Each component contains independent subcomponents that interact across it. All components within the architecture abstract away their underlying implementations and therefore allow their extensions to be straightforward. This helps in block-wise designing of newer systems by preserving the original functionality, yet also providing a free hand in customizing of each component.

Domain Knowledge
Domain knowledge contains static background information about the conversation topic. This can take the form of training data (e.g. transcribed conversations), constants, dictionaries, or restrictions on produced responses (e.g. sentence length, banned phrases). Data stored in domain knowl-edge must be independent of the model implementation, and can be shared between different models and components.

Input
The Input module provides or generates input utterances (i.e. statements, sentences) to the conversation pipeline. This component represents an abstract input device whose source of context varies depending on the use case. This could include a database of previous collected conversations, a terminal interface (i.e. stdin) to acquire data in real-time, or a web interface to a data source (e.g. mTurk).

Preprocessing
The Pre-processing module serves as a bridge between raw data acquired via the Input component and the input format required of components of the Dialogue model module. The system architect may choose to include one or several pre-processing operations within this module. These pre-processing operations by default are performed in parallel and their results are fed into the next component as an array. This allows the dialogue model to have multiple input representations. Alternatively, the framework also allows these operations to be sequentially processed in a specified order (e.g. spelling correction, followed by stemming). Pre-processing operations currently implemented in MACA include: getting POS tags, removing stop-words, sentence tokenizing (Loper and Bird, 2002), Byte-Pair encoding (BPE) (Gage, 1994) and can be extended to accommodate trained sentence2vec model (Le and Mikolov, 2014), trained word2vec model (Mikolov et al., 2013), etc. These nodes can also interact with the Domain Knowledge component to acquire domain specific information required for the operations.

Dialogue Model
This module is the core of the architecture, and contains implementations of agents capable of producing dialogue acts in response to the preprocessed Input information. This module can have up to three sub-components: Model Specific Pre-processing, Model Internals and Model Specific Post-processing, to accommodate dialogue agent models with various interface requirements.
The Model internals sub-module contains the central dialogue model, which may be an exist-ing model, such as a POMDP (Png and Pineau, 2011), Dual Encoder (Lowe et al., 2015a), HRED agent (Serban et al., 2015), or a newly designed model. This sub-module receives inputs from the Model Specific Pre-processing sub-module. The space of possible responses, vocabulary or dialogue acts are stored in the Domain Knowledge module. The Model internals and Model specific Pre/Post-processing sub-modules share the model information. Similar to the Pre-processing component, they can access any information required for their operations by querying the Domain Knowledge component. A specific illustration of this interaction is in goal-oriented dialogue agents, where the slot information -askQueries and other attributes of the slot and these slot objects -are maintained in the domain knowledge, which enables the framework to support multiple agents. In such settings, the Dialogue Model is initialized with a generic agent that tries to gauge the user intent, and then queries the domain knowledge for the appropriate slots.
Model specific Pre-processing and Postprocessing sub-components are provided to give the luxury of designing fine-tuned pre-processing for a model. Model Specific Pre-processing sub-component transforms pre-processed input(s) into appropriate representations compatible with the model internals (e.g. array of word indices into vector, matrix or lookup table, etc). On the other hand, Model Specific Post-processing subcomponent transforms model outputs into more comprehensible forms for the next independent component in the system (e.g. matrix/vector representation to array of words/sentences).
Although certain interpretations suggest analogies between the above sub-modules and conventional units of a goal-oriented dialogue system such as Dialogue Manager (DM) as Model internals, Natural Language Understanding (NLU) as Model specific Pre-processing, and Natural Language Generation (NLG) as Model specific Postprocessing, MACA does not impose any restriction on how the framework's sub-modules should correspond with these conventional parts of a dialogue system. For example, the architect may choose to have the Model internals sub-module act as a NLU unit, while Model specific Postprocessing act as both NLG Unit and DM unit.
In addition, as the model may also be an ensemble of dialogue models, the model specific pre-and post-processing sub-components can also be used to keep processing units specific to each of the model in the architecture. For clarification, in a typical implementation of an ensemble of models, the Model specific Pre-processing sub-component can be used to provide separate inputs parsed from the Pre-processing component to the corresponding models, while Model specific Post-processing sub-component can be used to perform a majority voting or other ensemble techniques to select the response pool.

Postprocessing
The Posprocessing component connects the Dialogue Model and the Output components. It allows the architect to choose the response in the case of multi-response retrieval, to alter responses based on linguistic characteristics, or to modify a response in accordance with the conversation domain. It may also serve as a translation of text to system calls, which is useful in the case where a dialogue agent placed as the front-end interface to another software system. Similar to the Preprocessing module, this component includes one or multiple post-processing operations, which process the output in parallel or in sequence, depending on the specification of the designer. In addition, these post-processing operations within the Post-processing component can also query the Domain Knowledge component for relevant data required for the generation of text response.

Output
Through the output component, the architecture provides a generic way to output the response to appropriate audience(s) depending on the use case. Currently, implemented options are command line, file based, web based, and database. Similar to the Input component, the output component provides flexibility for the architect to change the destination of produced outputs and to separate the output programming logic from that of other components.

Pubsub system/Listeners
In addition to the main pipeline presented above, the proposed system also includes a passive pubsub layer to facilitate monitoring, conversation recording, and independent evaluation of the model. This pubsub system allows the architect to choose or plug in a wide range of peripheral components (called Listeners) to passively monitor the main system for execution behaviors and performance. On top of several default channels (see Operation modes section below) that the system writes to and reads from, users can freely add their own channels to communicate between the main system and the pubsub layer hosting the peripherals.
Listeners, as previously mentioned, are optional modules that can be plugged in to passively monitor the system over different channels. These modules are useful when the architect is interested in observing the system inputs and/or outputs, or visualizing internal parameters or states of the dialogue model at execution time. Passive monitoring logic can be independently introduced into the system without modifying the other components' implementations.

Operation modes
MACA can be operated in three different modes: Data Collection, Training and Execution. This section describes the data flow in the architecture along with abstract setups of the framework's components in these different operation modes for several dialogue models from the recent literature. The goal of the data collection mode is to collect conversations as training datasets for dialogue models. In this mode, the two agents Alice and Bob involved in the conversation are considered the Input component and the Dialogue Model component respectively. Figure 2 describes a typical setup for the data collection process with said configuration. The conversation is recorded using a database listener that receives both input (context) and output (response) for each speaking turn, similar to the scheme presented in section 3.2.3 above.

Data Collection Mode
This setup realizes the infrastructure required for two common dialogue data collection scenarios. The first scenario is collection of both contexts and responses. In this case, both agents are humans. In the second scenario, the goal is to collect human responses for a given set of contexts. In this case, agent Alice can be an implementation of the Input component fetching contexts from a database, while Bob is a human agent responding to the fetched contexts. The goal of the training and validation mode is to use the data obtained in the data collection stage to train one or multiple dialogue models, as illustrated in figure 3. Assuming a dataset is available from the Domain Knowledge component, training data can be fetched as batches by the Input component and fed into the VoidPreprocessing component. This component simply forwards the data as is to the Dialogue Model component, which performs model training, and occasionally queries the domain knowledge for validation data to verify its training progress. Since system output is irrelevant within the training scenario, Post-processing and Output components are implemented with null operations, which simply discard their received contents. Once certain validation accuracy is achieved, the model can save its internals on to the disk and terminate the system. In addition to the core training process, the architect may opt to emit training information to a listener through the training channel to monitor the training progress. In this mode, all core components in the system are enabled and active. Given that the dialogue model has been successfully trained and fine-tuned, its internal states (e.g. weights, hyperparameters) are loaded into the Dialogue Model component at system initialization time. Input data is retrieved in real time (through local user interface (e.g. terminal, GUI) or via an interface with the Internet (e.g. web page, chat client)). This input then enters the pipeline and goes through Preprocessing, Dialogue model, Postprocessing and finally Output component. At the end of the pipeline, the output component is responsible for sending the generated responses to relevant audiences (e.g. print to stdout, HTTP response, ...).

Execution Mode
From the peripheral components perspective, conversation logging and system monitoring can be done through two default channels: input and output. Specifically, as shown in figure 4, the passive listener receives a notification for every input received from the Input component on the input channel, and a notification for every output received by the Output component on the output channel.

Feature Highlights
As discussed in the previous sections, MACA can be used to plug in different types of existing dialogue agents. The architecture abstracts the implementation details, similar to popular machine learning libraries such as Theano (Theano Development Team, 2016), Tensorflow (Abadi et al., 2016), or PyTorch. The modular design enables rapid prototyping and should facilitate reproducing previous results. The support for experimentation, extension, and development of slot-based dialogue agents for goal-oriented tasks has also been provided. In addition, the current implementation has rule-based approach for slot disambiguation and has provisions for the easy extension of slot disambiguation to machine learning (ML) based modules. The clear separation of domain knowledge from the agent aids in multi-agent systems with little dependence on the domain -the intent identification is provided at a higher level to identify and trigger the task, defined as a set of slots and ask queries. Intent identification supports hosting of multiple tasks.
The framework provides tools for easy hosting of dialogue tasks as HIT (Human Intelligence Task) on Amazon mTurk to collect human responses; the framework also supports modelling dialogue tasks as an agent-agent interaction that can be used to test a dialogue agent against simulated users (Schatzmann et al., 2005b). A summary of MACA's features is provided in Table 1.

MACA TCP Ravenclaw
Multi Domain Support Plug-and-Play Adaptation for FCA Agent Abstraction Integration with mTurk 5 Implementation Highlights 2 MACA's current implementation is in Python and includes standard libraries to ensure the framework's portability, as well as to facilitate rapid prototyping of different dialogue model strategies. Each component of the framework (e.g. Input component) is described with an abstract Python class, whose concrete implementation instances (i.e. Python objects) are manifestations of that component (e.g. Command line input, Database input). This corresponds to the abstraction layer of the architecture's module to foster independence of the pipeline implementation from that of the underlying dialogue model(s). The assembly of these components are then specified in a central configuration file representing an instantiation of the architecture. With this design, changes in the instantiation specifications can be done within the central configuration file by modifying the names of invoked modules. On the other hand, this setup allows system specifications to be completely contained within the central configuration file, which reduces maintenance effort and simplifies configuration modification during development. In addition, the open source nature of the framework encourages sharing and reusing of components, which allows researchers to easily develop from existing models and save time by reusing common components written by others.

Case Studies
MACA was deployed for several studies within our research group. All conducted studies have the same template for the central configuration file, whose content is then modified corresponding to the purpose of each study. Listing 1 shows the configuration template representing a system with a simple dialogue agent, which repeats its input 2 Some of the configuration file samples provided in the listings in this section are slightly modified to fit the page limit constraint.
(echo agent). The configuration file requires several attributes to be mentioned and provides a general outlook of the experiment being run. The template contains the following attributes: input, output, preprocessing, postprocessing, agent, domain knowledge and listeners. The class sub-attribute of the attributes refers to the Python class implementation of the component being invoked.

Building a simple agent
The Echo agent is designed to simply listen and store the input to file; this is a good first test case for new users of MACA. In this setup, the input attribute is instantiated with StdinInputDevice, which is the commandline inputs, and the output attribute is instantiated with FileOutputDevice, which writes the results to a file. Likewise, the instantiations of the other attributes, like postprocessing, preprocessing and domain knowledge, point to VoidPostprocessor, VoidPreprocessor, and EmptyDomainKnowledge respectively, since Echo agent does not require them. The agent attribute is instantiated with the appropriate dialogue agent, which in this case is Echo agent. Along with these components, LoggingListener, which logs the input and output of the system on to an output file, is included as a listener component.

Building a goal oriented system
Next, we consider using MACA to build goal oriented agents for the restaurant, flight booking, and other toy domains. These slot-based agents were developed using the tools provided in the framework that aids in hierarchical task decomposition and slot sharing across tasks (as in the example reusing the same Python variables). With regard to hosting a multi-task agent, the invocation of Goal oriented policies/sub-agents for each task happens with the description of slots -askQuery, disambiguation strategy etc. As with providing multiagent support, the architecture can handle multiple intents with intent triggers defined for each of them. For example, "I would like to book a flight" will trigger the flight booking policy which will fill in slots specific to this task based on the information provided in the domain knowledge, whereas "What's a good restaurant nearby?" will trigger the restaurant booking policy. The configuration file modification in the agent and domain knowledge attributes is provided in Listing 2. Listing 2: Sample Agent attribute in Goal Oriented Dialogue models' Configuration.
An overview of the architecture components in the goal oriented setting is provided in Table 2.

Building a neural response generation agent
We also used MACA to prototype neural response generation agents based on the Hierarchical Encoder-Decoder framework (Serban et al., 2015).

HRED in training mode
MACA's training mode was tested with the training process of an HRED agent. The modifications for the central configuration files for this  setup are presented in Listing 3. HREDTraining-InputDevice simply invokes the training process by sending an initiate message to the model while the dialogue model HREDAgent, configured to be in training mode, starts its regular training process and writes the trained weights to disk. The training dataset is specified using the prototype sub-attribute (in compliance with the HRED code base) within the train args attribute of agent. All other components of the pipeline are unchanged as it is unnecessary to postprocess or to output data. The HRED agent was trained using both the Twitter Corpus (Ritter et al., 2011) and Ubuntu Dialogue Corpus (Lowe et al., 2015b). Listing 3: Modified attributes for HRED training.

HRED in execution mode
We also tested using a trained HRED agent in execution and data collection modes. In the execution mode, MACA used the command-line as the input and the output units to fetch user responses and show model responses from HRED. In the data collection mode, MACA was hosted on a local psiTurk (Gureckis et al., 2016) server emulating mTurk. A layout that lets the users chat and score the model responses was provided, and user inputs were logged by a database listener through the pubsub architecture. In this scenario, the pre-trained HRED model can be seen as a case of custom built dialogue agent adapted to MACA. Listing 4: Agent attribute in HRED Configuration.
The central configuration file from Listing 1 is updated for HRED in execution mode, as shown in Listing 4. The model specific arguments, provided between lines 3 and 14, in Listing 4 demonstrate MACA's support for plugging in customized or pre-trained dialogue agents. Furthermore, an overview of the architecture, with the instantiated components, and their roles is provided in Table 3.

Building a neural response retrieval agent
Finally, we built an architecture that incorporates a neural response retrieval agent operating using the Dual Encoder method (Lowe et al., 2015a).

Dual Encoder in training mode
Listing 5 presents changes to the template configuration to incorporate a Dual Encoder dialogue agent in training mode. Similar to the HRED model training case, we replace the Input and Model modules in the template configuration. In the case of Dual Encoder, the specified data set will be loaded into DomainKnowledge and will become accessible after initialization. During the training process, RetrievalModelTrainingInputDevice retrieves the data from the specified train-ing data set via DomainKnowledge and feeds it to the Dialogue Model while the RetrievalMode-lAgent contains the relevant training parameters. Once training finishes, RetrievalModelTrainingIn-putDevice issues a message to the agent to write out trained weights to disk.

Dual Encoder in execution mode
We also tested the Dual Encoder agent in execution mode, which is an instance of adapting a retrieval based model to the proposed framework. The execution mode in this case obtained inputs from a database of previously collected contextresponse pairs. The configuration file for the Dual Encoder model looks mostly similar to the generic template, with modification on the agent attribute, described in Listing 6. The configuration file's flexibility allows customized agents to be plugged in with ease, while providing the parameters for the model to run in the model params sub-attribute. Further, an overview of MACA with its instantiated components and their roles is provided in Table 4; specification of these attributes within MACA is achieved through the configuration file.

Discussion
MACA offers a unified architecture for dialogue agents that supports the plug-n-play of different types of dialogue agents and different domains. We hope that this will facilitate the fast development of new models, but also foster reproducibility in dialogue system research.
A few possible limitations in the current implementation of MACA include simplicity of the pubsub system, lack of support for distributed hosting of different components of the architecture, and lack of support for parallel conversations. As future work, the pubsub system could be improved by capturing a wider range of system information with more monitoring pubsub channels. In addition, we plan to incorporate new domains and agents as they become available, along with comprehensive ML based slot-disambiguation modules.