Frames: a corpus for adding memory to goal-oriented dialogue systems

This paper proposes a new dataset, Frames, composed of 1369 human-human dialogues with an average of 15 turns per dialogue. This corpus contains goal-oriented dialogues between users who are given some constraints to book a trip and assistants who search a database to find appropriate trips. The users exhibit complex decision-making behaviour which involve comparing trips, exploring different options, and selecting among the trips that were discussed during the dialogue. To drive research on dialogue systems towards handling such behaviour, we have annotated and released the dataset and we propose in this paper a task called frame tracking. This task consists of keeping track of different semantic frames throughout each dialogue. We propose a rule-based baseline and analyse the frame tracking task through this baseline.


INTRODUCTION
Goal-oriented, information-retrieving dialogue systems have traditionally been designed to help users find items in a database given a certain set of constraints (El Asri et al., 2014;Laroche et al., 2011;Raux et al., 2003;Singh et al., 2002). For instance, the LET'S GO dialogue system finds a bus schedule given a bus number and a location (Raux et al., 2003).
These systems model dialogue as a sequential process: the system asks for constraints until it can query the database and return a few results to the user. Then, the user can ask for more information about a given result or ask for other possibilities. If the user wants to know about database items corresponding to a different set of constraints (e.g., another bus line), then these constraints simply overwrite the previous ones. As a consequence, users can neither compare results corresponding to different constraints, nor go back-and-forth between results.
We can assume users in the bus domain know exactly what they want. In contrast, user studies in e-commerce have shown that several information-seeking behaviours are encountered: users may come with a very well defined item in mind, but they may also visit an e-commerce website with the intent to compare items and explore different possibilities (Moe and Fader, 2001). Supporting this kind of decision-making process in conversational systems implies adding memory. Memory is necessary to track different items or preferences set by the user during the dialogue. For instance, consider product comparisons. If a user wants to compare different items using a dialogue system, then this system should be able to separately recall properties pertaining to each item. This paper presents the Frames dataset, which comprises dialogues that require this type of memory. Frames is a corpus of 1369 human-human dialogues collected in a Wizard-of-Oz (WOz) settingi.e., users were connected to humans, whom we refer to as the wizards, who were assuming the role of the dialogue system. The wizards had access to a database of vacation packages containing round-trip flights and a hotel. The users were asked to find packages based on a few constraints such as a destination and a budget.
In order to test the memory capabilities of conversational agents, we formalize a new task called frame tracking. In frame tracking, the agent must simultaneously track multiple semantic frames throughout the dialogue. For example, two frames would be constructed and recalled while comparing two products -each containing the properties of a specific item. Frame tracking is a generalization of the state tracking task (Henderson, 2015). In state tracking, all the information summarizing a dialogue history is compressed into one semantic frame. In contrast, several frames are kept in memory during frame tracking, with each frame corresponding to a particular context, e.g., one or more vacation packages.
Another important property of human dialogue that we want to study with the Frames dataset is how to provide the user with information on the database. When a set of user constraints leads to no results, users would benefit from knowing that relaxing a given constraint (e.g., increasing the budget by a reasonable amount) would lead to results instead of navigating the database blindly. This recommendation behaviour is in accordance with Grice's cooperative principle (Grice, 1989): "Make your contribution as informative as is required (for the current purposes of the exchange)". We study this by including suggestions when the database returns no results for a given user query.
This paper describes the Frames dataset in detail, formally defines the frame tracking task, and provides a baseline model for frame tracking. The next section discusses motivation for the Frames dataset. Section 3 explains the data collection process and Section 4 describes the dataset in detail. We describe the annotation scheme in Section 5. In Section 6, we identify the main research topics of the corpus and formalize the frame tracking task. The dialogue data format is described in Section 7. Section 8.1 proposes a baseline model for frame tracking. Finally, we conclude in Section 9 and suggest directions for future work.

MOTIVATION
Much work has focused on spoken dialogue (Lemon and Pietquin, 2007;Walker et al., 1998;Williams and Zweig, 2016), since spoken dialogue systems are useful in many settings, including in hands-free environments such as cars (Lemon et al., 2006). A generation of voice assistants -such as SIRI, Cortana, and Google Voice -have popularized spoken dialogue systems. More recently, users have become familiar with chatbots. Many platforms for deploying chatbots are now available, such as Facebook Messenger, Slack, or Kik. Text offers advantages over voice such as privacy and the ability to avoid bad speech recognition in noisy environments. Chatbots provide a welcome alternative to downloading and installing applications, and make a lot of sense for everyday services such as ordering a cab or knowing the weather. Chatbots have been proposed for tasks that one would traditionally perform through Graphical User Interfaces (GUIs). For instance, many chatbots for booking a flight are now entering the market.
In most cases, as with current voice-based assistants, the conversation with a chatbot is very limited: asking for the weather and ordering a cab are accomplished with simple, sequential slot-filling. These tasks have in common the fact that in both cases the user knows exactly what she wants, i.e., the destination for the cab or the city for the weather. Booking a flight is a bit different. Flight booking requires specifying many parameters, and these are usually determined during the search process 2 . Technically, finding a weather forecast is only about reading a database: the task is to form a complete database query and then to verbalise the result to the user. The user might start with very few constraints and then refine her query given the database results. In the case of booking a flight, there is a decision-making process requiring comparison and backtracking.
GUIs are not optimal on many levels when it comes to helping users through this decision-making process. A first point of friction is the limited visual space. Consider the example of e-commerce websites. A user is very likely to compare different options before picking an item to buy. This often results in a large number of open browser tabs among which the user must navigate. In order to avoid this situation, some websites provide a comparator that can be used to display several items on a single page. However, this option is not optimal for hierarchical objects such as vacation packages because these objects have global properties (dates of the trips) but are also composed of different modules (flights and hotel) which have different properties (e.g., seat class and hotel category). Optimally, the user should be able to define the properties for the different modules while being able to compare items corresponding to each set of properties. A text interface could complement a GUI by offering this flexibility while remembering the properties mentioned by the user and displaying comparisons when asked.
We propose the Frames dataset to support work on text-based conversational agents which help a user make a decision. The decision-making process is tightly coupled with the notion of memory. Indeed, if a user intends to compare different options in the course of defining the options, the system should follow the user's path and remember every option. In this paper, we formalize this aspect of conversation in the frame tracking task. This task is the main challenge of the Frames corpus, and Section 6 describes it in detail.

DATA COLLECTION
We collected data over a period of 20 days and with 12 participants. To increase variation in the dialogues, 8 participants were hired for only one week each, and one participant was hired for one day. The three remaining participants participated in the entire data collection. The participants were paired up for each dialogue and interacted through a chat interface.

WIZARD-OF-OZ DATA COLLECTION
Data collection for goal-oriented dialogue is challenging. To control the data such that specific aspects of the problem can be studied, it is common to collect dialogues using an automated system. This requires, e.g., a natural language understanding module that already performs well on the task, which implies possession of in-domain data or the ability to generate it (Henderson et al., 2014b;Raux et al., 2003). Another possibility, which permits even greater control, is to generate dialogues using a rule-based system (Bordes and Weston, 2016). These approaches are useful for studying specific modules and analysing the behaviour of different architectures. However, it is costly to generate new dialogues for each experiment and skills acquired on artificial data are not directly usable in real settings because of natural language understanding noise (Bordes and Weston, 2016).
The Wizard-of-Oz (WOz) approach offers a powerful alternative (Kelley, 1984;Rieser et al., 2005;Wen et al., 2016). In WOz data collection, one participant (the wizard) plays the role of the dialogue system. The wizard has access to a search interface connected to the database. She receives the user's input in text form and decides what to say next. This does not require preexisting dialogue system components, except potentially automatic speech recognition for transcribing the user's inputs. Dialogues collected in WOz settings can be used for studying and developing every part of a dialogue system, from language understanding to language generation. They are also essential for offline training of end-to-end dialogue systems (Bordes and Weston, 2016;Wen et al., 2016) on different domains, which may reduce costs from handcrafting new systems for each domain.
WOz dialogues also have the considerable advantage of exhibiting realistic behaviours that cannot be supported by current (end-to-end or not) architectures. Since there is no dialogue system that incorporates the type of memory that we want to study with this dataset, we need to work directly on human-human dialogues. Our setting is a bit different from the usual WOz setting because, in our case, the users did not think they were interacting with a dialogue system but instead knew that they were talking to a human-being. We made the choice not to give templated answers to the wizards because, apart from studying memory, we also want to study information presentation and dialogue management. We have chosen to work on text-based dialogues because this allows a more controlled wizard behaviour, obviates handling time-sensitive turn-taking and speech recognition noise, and allows studying more complex dialogue flows.

TASK TEMPLATES AND INSTRUCTIONS
Dialogues were performed on Slack 3 . We deployed a Slack bot named wozbot to pair up participants and record conversations. The participants in the user role indicated when they were available for a new dialogue through this bot. They were then assigned to an available wizard and received a new task. The tasks were built from templates such as the following: Each template had a probability of success. The tasks were generated by drawing values (e.g., BUDGET) from the database. The generated tasks were then added to a pool. The values were drawn in order to comply with the template's probability of success. For example, if 20 tasks were generated at probability 0.5, about 10 tasks would be generated with successful database queries and the other 10 would be generated so the database returned no results for the constraints. This mechanism allowed us to emulate cases when a user would not find anything meeting her constraints. If a task was unsuccessful, the user either ended the dialogue or got an alternate task such as: "If nothing matches your constraints, try increasing your budget by $200." We wrote 38 templates. 14 templates were generic such as the one presented above and the other 24 were written to encourage more role-playing from users. One example is: . You want to compare the packages between the different cities and book one, the one that will take you to your destiny." These templates were meant to add variety to the dialogues. The generic templates were also important for the users to create their own character and personality. We found that the combination of the two types of templates prevented the task from becoming too repetitive. Notably, we distributed the role-playing templates throughout the data collection process to bring some novelty and surprise. We also asked the participants to write templates (13 of them) to keep them engaged in the task.
To control data collection, we gave a set of instructions to the participants. The users received the following instructions: • Do not use uncommon slang terms, but feel free to use colloquialisms.
• Make up personalities.
• Feel free to end the conversation at any time.
• Try to spell things correctly.
• You do not necessarily have to choose an option.
• Try to determine what you can get for your money.
These instructions were meant to encourage a variety of behaviours from the users. As for the wizards, they were asked to only talk about the database results and the task-at-hand. This is necessary if one wants to build a dialogue system that emulates the wizards' behaviour in this corpus. The wizard instructions were as follows: • Be polite, and do not jump in on the role play of the users.
• Vary the way you answer the user, sometimes you can say something that would be right at another point in a dialogue. • Your knowledge of the world is limited by your database.
• Try to spell things correctly.
We asked the wizards to sometimes act badly (second point in the list). It is interesting from a dialogue management point of view to have examples of bad behaviour and of how it impacts user satisfaction. At the end of each dialogue, the user was asked to provide a wizard cooperativity rating on a scale of 1 to 5. The wizard, on the other hand, was shown the user's task and was asked whether she thought the user had accomplished it.

DATABASE SEARCH INTERFACE
Wizards received a link to a search interface every time a user was connected with them. The search interface was a simple GUI with all the searchable fields in the database (see Appendix A). For every search in the database, up to 10 results were displayed. These results were sorted by increasing price.

SUGGESTIONS
When a database query returned no results, suggestions were sometimes displayed to the wizards. Suggestions were packages obtained by relaxing one or more constraints. It was up to the wizard to decide whether or not to use suggestions. Our goal with suggestions is not to learn a recommender system, but to learn the timing of recommendation, hence the randomness of the mechanism.

STATISTICS OF THE CORPUS
We used the data collection process described in the previous section to collect 1369 dialogues. Figure 1a shows the distribution of dialogue length in the corpus. The average number of turns is 15, for a total of 19986 turns in the dataset. A turn is defined as a Slack message sent by either a user or a wizard. A user turn is always followed by a wizard turn and vice versa. Figure 1b shows the number of acts per dialogue turn. About 25% of the dialogue turns have more than one dialogue act. The turns with 0 dialogue act are turns where the user asked for something that the wizard could not provide, e.g., because it was not part of the database. An example in the dataset is: "Would my room have a view of the city? How much would it cost to upgrade to a room with a view?". We left such user turns unannotated and they are usually followed up by the wizard saying she cannot provide the required information. Figure 1c shows the distribution of user ratings. More than 70% of the dialogues have the maximum rating of 5. Figure 1d shows the occurrences of dialogue acts in the corpus. The dialogue acts are described in Table 9 in Appendix B. We present the annotation scheme in the following section.

DIALOGUE ANNOTATION SCHEME
We annotated the Frames dataset with three types of labels: 1. Dialogue acts, slot types, slot values, and references to other frames for each utterance.
2. The ID of the currently active frame.
3. Frame labels which were automatically computed based on the previous two sets of labels.

DIALOGUE ACTS, SLOT TYPES, AND SLOT VALUES
Most of the dialogue acts used for annotation are acts which are usually encountered in the goaloriented setting such as inform and offer. We also introduced dialogue acts which are specific to our frame tracking setting such as switch frame and request compare. The dialogue acts are listed in Table 9.
Our annotation uses three sets of slot types. The first set, listed in Tables 7 and 8, corresponds to the fields of the database. The second set is listed in Table 10 and contains the slot types which we defined in order to describe specific aspects of the dialogue such as intent and action. An intent indicates whether or not the user wants to book a package, whereas an action indicates whether or not the wizard should, or did, book it. We also introduced several count slot types which were used most often by wizards to summarize information in the database, e.g., "I have 2 hotels in Marseille". In this case, the wizard informs that the count for hotels is 2.
The remaining slot types in Table 10 were introduced to describe frames and cross-references between them. Before we discuss these slot types, we define frames more formally in the following section.

DEFINITION
Semantic frames form the core of our dataset. A semantic frame is defined by the following four components: • User requests: slots whose values the user wants to know for this frame.
• User binary questions: user questions with slot types and slot values.
• Constraints: slots which have been set to a particular value by the user or the wizard.
• User comparison requests: slots whose values the user wants to know for this frame and one or more other frames.
Several of these labels are used in the Dialogue State Tracking Challenge (DSTC) . In DSTC, a semantic frame contains the constraints set by the user, the user requests, and the user's search method (e.g., by constraints or alternatives). In our case, constraints can also be set by the wizard when she suggests or offers a package. Any field in the database (see Tables 7  and 8 in Appendix A) can be constrained by the user or the wizard. The comparison requests and the binary questions were added after analysing the dialogues. The comparison requests correspond to the request compare dialogue act. This dialogue act is used to annotate turns when a user asks to compare different results, for instance: "Could you tell me which of these resorts offers free wifi?". These questions relate to several frames. Binary questions are questions with slot types and slot values, e.g., "Is this hotel in the downtown area of the city?" (request act), or "Is the trip to Marseille cheaper than to Naples?" (request compare act), as well as all confirm acts. Binary questions concern one or several frames.

CREATION AND ANNOTATION
Each dialogue starts in frame 1. New frames are introduced when the wizard offers or suggests something, or when the user modifies pre-established slots. Thus, all values discussed during the dialogue are recorded and the user can return to a previous set of constraints at any point. An example is given in Table 1. In this example, the frame number changes when the user modifies several slot values: the destination city, the number of adults for the trip, and the budget. Though frames are created for each offer or suggestion made by the wizard, the active frame can only be changed by the user so that the user has control over the dialogue. If the user asks for more information about a specific offer or suggestion, the active frame is changed to the frame introduced with that offer or suggestion. This change of frame is indicated by a switch frame act (see Appendix A). The rules for creating and switching frames are summarized in Table 2. We introduced specific slot types for recording the creation and modification of frames. These slot types are id, ref, read, and write (see Table 10 in Appendix B). The frame id is defined when the frame is created and is used to switch to this frame when the user decides to do so.
The other slot typesref, read, and write -are used to annotate cross-references between frames, which are a crucial component of the recorded dialogues. A reference has two parts: the number of the frame it is referring to and the slots and values that are used to refer to that frame (if any). For instance, ref [1{name=Tropic}] means that frame 1 is being referred to by the hotel name Tropic. If anaphora is used to refer to a frame, we annotated this with the slot ref anaphora (e.g., "This is too long"inform(duration=too long,ref anaphora=this)). Inside an offer dialogue act, a ref means that the frame corresponding to the offer is derived from another frame. For example, here is an utterance from the corpus, written by a wizard: "Here are a couple of options. The first option is a 3.0 star hotel (the Tropic), with a guest rating of 4.77/10 and a business class flight. The cost is 1002.27 USD. Or, if you prefer, you could choose the same 3.0 star hotel with a guest rating of 4.77/10 (the Tropic) and an economy flight, for 812.69."  This utterance is annotated with the following dialogue acts: • offer(category=3.0,name=Tropic,gst rating=4.77/10,id=6); • offer(ref=[6],seat=business,price=1002.27 USD,id=7); • and offer(ref=[6],seat=economy,price=812.69,id=8).
Here, the frames corresponding to the last two offers are derived from the first one by inheriting all values.
The slot types read and write only occur inside a wizard's inform act and are used by the wizards to provide relations between offers or suggestions: read is used to indicate which frame the values are coming from (and which slots are used to refer to this frame, if any), while write indicates the frame where the slot values are to be written (and which slot values are used to refer to this frame, if any). If there is a read without a write, the current frame is assumed as the storage for the slot values. A slot type without a value indicates that the value is the same as in the referenced frame, but was not mentioned explicitly i.e., "for the same price". Table 3 gives an example of how these slot types are used in practice: inform( read=[7{dst city=Punta Cana, category=2.5}] means that the values 2.5 and Punta Cana are to be read from frame 7, and to be written in the current frame. At this turn of the dialogue, the wizard repeats information from frame 7. The annotation inform(breakfast=False, write=[7{name=El Mar}]) means that the value False for breakfast is written in frame 7 and that frame 7 was identified in this utterance by the name of the hotel El Mar.
The average number of frames created per dialogue is 6.71 and the average number of frame switches is 3.58. Figure 2 shows boxplots for the number of frame creations and the number of frame changes in the corpus.

RESEARCH TOPICS
Frames can be used to research many aspects of goal-oriented dialogue, from Natural Language Understanding (NLU) to natural language generation. In this section, we propose three topics that we believe are new and representative of this dataset.
6.1 FRAME TRACKING

DEFINITION
Frame tracking extends state tracking (Henderson, 2015; to a setting where several semantic frames are tracked simultaneously. In state tracking, the dialogue history is compressed into one semantic frame. Essentially, this implies that every new slot value overwrites the previous one, which prevents the user from comparing options or returning to an item discussed earlier. In frame tracking, a new value creates a new semantic frame. The frame tracking task is significantly harder as it requires, for each user utterance, identifying the active frame as well as all the frames modified by the utterance. Definition 1 (Frame Tracking). At each user turn t, we assume access to the full dialogue history H = {f 1 , ..., f nt−1 }, where f i is a frame and n t−1 is the number of frames created so far in the dialogue. For a user utterance u t at time t, we provide the following NLU labels: dialogue acts, slot types, and slot values. The goal of frame tracking is to predict if a new frame is created and to predict for each dialogue act the ref labels (possibly none) and the ids of the frames referenced.
Predicting the frame that is referenced by a dialogue act requires detecting if a new frame is created and recognizing a previous frame from the values mentioned by the user (potentially a synonym, e.g., NYC for New York), or by using the user utterance directly. It is necessary in many cases to use the user utterance directly because users do not always use slot values to refer to previous frames. An example in the corpus is a user asking: "Which package has the soonest departure?". In this case, the user refers to several frames (the packages) without ever explicitly describing which ones. This phenomenon is quite common for dialogue acts such as switch frame (979 occurrences in the corpus) and request compare (455 occurrences in the corpus). These cases can only be resolved by working on the text directly and solving anaphora.

EVALUATION METRICS
We define two metrics: frame identification and frame creation. For frame identification, for each dialogue act, we compare the ground truth pair (key-value, frame) to that predicted by the frame tracker. We compute performance as the number of correct predictions over the number of pairs. A prediction is correct if and only if the frame, key, and value are the same in the ground truth and prediction. The frame is the id of the referenced frame. The key and value are respectively the type and the value of the slot used to refer to the frame (these can be null).
For frame creation, we compute the number of times the frame tracker correctly predicts that a frame is created or correctly predicts that a frame has not been created over the number of dialogue turns.

RELATED WORK
Frame tracking is closely related to state tracking in that it extends the task from only tracking the current user goal to tracking all the user goals that occur during the dialogue.
Recent approaches to state tracking have been suggested to go beyond the sequential slot-filling approach. An important contribution is the Task Lineage-based Dialog State Tracking (TL-DST) proposed by (Lee and Stent, 2016). TL-DST is a framework that allows keeping track of tasks across different domains. Similarly to frame tracking, Lee   Thank you! 3 structure of the dialogue containing different frames corresponding to different tasks. They defined different sub-tasks among which task frame parsing which is closely related to frame tracking except that they impose constraints on how a dialogue act can be assigned to a frame and a dialogue act can only relate to one frame. Because of the lack of data, Lee and Stent (2016) trained their tracking model on datasets released for DSTC (DSTC2 and DSTC3, Henderson et al., 2014a,b). As a result, they could artificially mix different tasks, e.g., looking for a restaurant and looking for a pub, but they could not study how human beings switch between topics. In addition, this framework can switch between different tasks but does not handle comparisons which is an important aspect of frame tracking.
Another related approach was proposed by (Perez and Liu, 2016) who framed the state tracking task as a question-answering task. Their state tracker is based on a memory network (Weston et al., 2014) and can answer questions about the user goal at the end of the dialogue. They also propose adding functionalities such as keeping a list of the constraints expressed by the user during the dialogue.
We propose the dataset in order to encourage more research on complex state tracking behaviours. In addition, we propose the frame tracking task as a principled way of modelling such behaviour in the case of decision-making but researchers are free to use this dataset for any task that they define.

DIALOGUE MANAGEMENT
One of the notable aspects of this dataset is that memory is not only a matter of frame tracking. Most of the time, the wizard would speak about the current frame to ask or answer questions. However, sometimes, the wizard would talk about previous frames. We can see it as appealing to memories in a conversation. An example is given in Table 4. In the bold utterance in this dialogue, even though the active frame is frame 4, the wizard mentions a previous frame (frame 3). In order to reproduce this kind of behaviour, a dialogue manager would need to be able to identify potentially relevant frames for the current turn and to output actions for these frames. Table 4 also illustrates another novelty. In the utterance in italic, the wizard actually performs two actions. The first action consists of informing the user about the price of the regal resort and the second action consists of proposing another option, Hotel Globetrotter. Performing more than one action per turn is a challenge when using reinforcement learning (Fatemi et al., 2016;Gašić et al., 2012;Pietquin et al., 2011) and, to our knowledge, this has only been done in a simulated setting (Laroche et al., 2009).

NATURAL LANGUAGE GENERATION
An interesting behaviour observed in our dataset is that wizards often tended to summarize database results. An example is the wizard saying: "The cheapest available flight is 1947.14USD." In this case, the wizard informs the user that the database has no cheaper result than the one she is proposing. To imitate this behaviour, a dialogue system would need to reason over the database and decide how to present the results to the user.

DIALOGUES
We provide the Frames dialogues in JSON format. Each dialogue has five main fields: turns, labels, user id, wizard id, and id. The ids are unique for each dialogue (id), each user (user id), and each wizard (wizard id).
The labels have two fields: • userSurveyRating user rating of wizard cooperativity on a scale of 1 to 5 (see Section 3).
The turns have the following fields: • author "user" or "wizard".
• text the author's utterance.
• labels the id of the currently active frame (active frame) as well as a list of dialogue acts (acts) each with a name, and args (key-value pairs), and a list of dialogue acts without ref tags (acts without refs) for frame tracking. • timestamp timestamp for the message.
• frames List of all frames after integrating the current turn. Each frame has the following labels: • frame id id of the frame.
• frame parent id id of the parent frame.
• requests, binary questions, compare requests user questions (see Section 5.2.1). • info properties of the frame (see Section 5.2.1) Note that each slot can have multiple values, which accumulate as long as the frame does not change. For example, price can be both "1000 USD" and "cheapest". Each value has a boolean property "negated", expressing whether the user negated the value of the corresponding slot, for instance "I don't want to stay 3 days" (negate(duration=3)), or negated an explicit confirmation question. When a user switches to a frame, we assume the user accepts all information provided by the wizard for that frame as "constraints". We drop these additional constraints when a constraint is modified by the user, or the user requests alternatives. Our motivation for this scheme is to make frames more distinguishable and encourage methods which correctly identify frame switches. Additionally to slots and their values, we added the following fields to keep track of specific aspects of the dialogue: • REJECTED a boolean value expressing if the user negated or affirmed an offer made by the wizard (corresponds to a negate act that does not follow a question). • MOREINFO a boolean value expressing whether the user wants to know more about this frame, which happens if the wizard withholds detail information (see moreinfo act). • db (wizard turns only) list of search queries made by the wizard with the associated search results/suggestions.

HOTELS
The vacation packages were generated randomly. A database of packages can be created by using the search results in the JSON files containing the dialogues. The hotels in these search results have all the fields listed in the Hotel Properties section of Table 8 in Appendix A. Note that amenities or points of interest in the vicinity of the hotel are only listed in a hotel's description if they are true. For instance, the field breakfast is only present for hotels proposing free breakfast. Figure 3 shows statistics for these boolean values. . We generated these word-level tags by matching the slot values in the manual annotations with the corresponding textual utterances. The act tags were also generated at the word level: for a given dialogue act with slot values, each word between the slot value that occurred first in the text and the one that occurred last in the text was tagged with the corresponding act. The other words were tagged with O.
The NLU model is illustrated in Fig. 4. The IOB tagging part operates on character trigrams and is based on the robust named entity recognition model (Arnold et al., 2016). We predict, for each word of the utterance, a pair of tags -one for the act and one for the slot. The model splits into two parts: one part is trained to predict dialogue acts and the other part is trained to predict slot types (at this stage, we predict either a slot type or an O tag). These two parts share an embedding matrix for the input character trigrams. Note that the model only predicts IOB tags for slots whose values can be found in the text. Therefore, the prediction for slots such as intent or vicinities and amenities is not evaluated for this simple baseline.
The two parts of the model are trained simultaneously, using a modified categorical crossentropy loss for either set of outputs. We modify the loss to ignore O labels that are already predicted correctly by the model. We introduce this modification because O labels are far more frequent than other labels, and not limiting their contribution to the loss causes the model to get stuck into a mode where it predicts O labels for every word. The loss for the two parts of the model are added together, and the combined objective is optimized using the ADAM optimizer (Kingma and Ba, 2014).  Figure 4: Illustration of the NLU model for slots and acts prediction, taking input words and outputting labels for slots and acts. The model splits into slots-specific and acts-specific predictors after the word embedding layer, which computes a non-linearity on top of the per-word sum of character trigram embeddings.  We provide F1 scores for acts and slots for this model in Table 5. We report average and standard deviation over ten leave-one-user-out splits of the Frames dataset. We had a total of 11 participants in the user role during data collection. Two participants performed significantly fewer dialogues than the others. We merged the dialogues generated by these two participants (ids U21E41CQP and U23KPC9QV). For each of the resulting 10 users, we randomly split the combined dialogues of the nine others into training (80%) and validation (20%), and then tested on the dialogues from the held-out user.

FRAME TRACKING
The rule-based frame tracker takes as input the acts without refs annotation and the values set in the existing frames. We write f [k] to denote the value of slot k in frame f . According to handdesigned rules, the frame tracker predicts the ref tags (for frame identification, see Section 6.1.2) and frame creations. For an act a(k=v) in frame f , the following rules are used: We compare this baseline to random performance. For random performance, for each (dialogue act, slot type) combination, we computed priors on the corpus for each time the user would refer to the current frame vs a previous one. We sampled whether each slot was referring to the current frame or another one based on that prior, and if it referred to another frame, the frame number for that other frame was sampled uniformly from the list of frames created so far. Table 6 presents results for these baselines. We report results over 10 runs following the same method as for the NLU model. Table 6 shows that the rule-based baseline only performs slightly better than random on frame identification and performs similarly on frame creation. In general, these results suggest that simple rules are far from adequate for frame tracking and require more in-depth analysis of the user text.

CONCLUSION AND FUTURE WORK
In this paper, we introduced the Frames dataset: a corpus of human-human dialogues for researching the role of memory in goal-oriented dialogue systems. We propose this dataset to study memory in goal-oriented dialogue systems. We formalized the frame tracking task, which extends the state tracking task to a setting where several semantic frames are simultaneously tracked throughout the dialogue. We proposed a baseline for this task and we showed that there is a lot of room for improvement. Finally, we showed that Frames can be used to research other interesting aspects of dialogue such as the use of memory for dialogue management and information presentation through natural language generation. We propose adding memory as a first milestone towards goal-oriented dialogue systems that support more complex dialogue flows. Future work will consist of proposing models for frame tracking as well as proposing a methodology to scale up data collection and annotation.
A DATABASE OVERVIEW    Words used to refer to a frame e.g., "the second package' impl anaphora Used when a slot type is not specifically mentionned e.g., "What is the price for Rio?"..."And for Cleveland?" ref Id of the frame that the speaker is referring to read Reads slot values specified in another frame and writes them in the current frame write Writes slot values in a given frame intent User intent (e.g., book) action Wizard action (e.g., book)