Bootstrapping incremental dialogue systems from minimal data: the generalisation power of dialogue grammars

We investigate an end-to-end method for automatically inducing task-based dialogue systems from small amounts of unannotated dialogue data. It combines an incremental semantic grammar - Dynamic Syntax and Type Theory with Records (DS-TTR) - with Reinforcement Learning (RL), where language generation and dialogue management are a joint decision problem. The systems thus produced are incremental: dialogues are processed word-by-word, shown previously to be essential in supporting natural, spontaneous dialogue. We hypothesised that the rich linguistic knowledge within the grammar should enable a combinatorially large number of dialogue variations to be processed, even when trained on very few dialogues. Our experiments show that our model can process 74% of the Facebook AI bAbI dataset even when trained on only 0.13% of the data (5 dialogues). It can in addition process 65% of bAbI+, a corpus we created by systematically adding incremental dialogue phenomena such as restarts and self-corrections to bAbI. We compare our model with a state-of-the-art retrieval model, MEMN2N. We find that, in terms of semantic accuracy, the MEMN2N model shows very poor robustness to the bAbI+ transformations even when trained on the full bAbI dataset.


Introduction
There are currently several key problems for the practical data-driven (rather than hand-crafted) development of task-oriented dialogue systems, 1 Dataset available at https://bit.ly/babi_plus among them: (1) large amounts of dialogue data are needed, i.e. thousands of examples in a domain; (2) this data is usually required to be annotated with task-specific semantic/pragmatic information for the domain (e.g. various dialogue act schemes); and (3) the resulting systems are generally turn-based, and so do not support natural spontaneous dialogue which is processed incrementally, word-by-word, with many characteristic phenomena that arise from this incrementality.
In overcoming issue (2), a recent advance made in research on (non-task) chat dialogues has been the development of so-called "end-to-end" systems, in which all components are trained from textual dialogue examples, e.g. (Sordoni et al., 2015;Vinyals and Le, 2015). However, as Bordes and Weston (2017) argued, these end-to-end methods may not transfer well to task-based settings (where the user is trying to achieve a domain goal, such as booking a flight or finding a restaurant, resulting in an API call). Bordes and Weston (2017) then presented an end-to-end method using Memory Networks (memn2ns), which achieves 100% performance on a test-set of 1000 dialogues, after being trained on 1000 dialogues. This method processes dialogues turn-by-turn, and so does not have the advantages of more natural incremental systems (Aist et al., 2007;Skantze and Hjalmarsson, 2010); nor does it really perform language generation, rather it's based on a retrieval model that selects from a set of candidate system responses seen in the data.
This paper investigates an approach to these challenges -dubbed babble -using an incremental, semantic parser and generator for dialogue Eshghi, 2015), based around the Dynamic Syntax grammar formalism (DS, Kempson et al. (2001); Cann et al. (2005)).
Our advance in this paper, for end-to-end systems, is therefore twofold: (a) the babble method overcomes the requirement for large amounts of dialogue data (i.e. 1000s of dialogues in a domain); (b) resulting systems are word-by-word incremental, in both parsing, generation and dialogue management. We show that using only 5 example dialogues from the bAbI, Task 1 dataset (i.e. 0.13% of the training data used by (Bordes et al., 2017)) babble can automatically induce dialogue systems which process 74% of the bAbI testset in an incremental manner. We then introduce an extended incremental version of the bAbI dataset, which we call bAbI+ (see section 4.1), which adds some characteristic incremental phenomena -such as mid-utterance self-corrections -to the bAbI dialogues (this new dataset is freely available). Using this, we demonstrate that the babble system can in addition generalise to, and process 65% of the bAbI+ dataset, still when trained only on 5 dialogues from bAbI. We compare this method to (Bordes et al., 2017)'s memn2n, which, in terms of semantic accuracy (reflected in how well api-calls are predicted at the end of bAbI Task 1), shows very poor robustness to the bAbI+ transformations, even when it is trained on the full bAbI dataset.
This overall method is portable to other taskbased domains. Furthermore, as we use a semantic parser, the semantic/contextual representations of the dialogue can be used directly for large-scale inference, required in more complex tasks (e.g. interactive QA and search).

Dimensions of Pragmatic Synonymy
There are two important dimensions along which dialogues can vary, but nevertheless, lead to identical contexts: interactional, and lexical. Interactional synonymy is analogous to syntactic synonymy -when two distinct sentences are parsed to identical logical forms -except that it occurs not only at the level of a single sentence, but at the dialogue or discourse level. Fig. 1 shows examples of interactional variants that lead to very similar final contexts, in this case, that the user wants to buy an LG phone. These dialogues can be said to be pragmatically synonymous for this domain. Arguably, a good computational model of dialogue processing, and interactional dynamics should be able to capture this synonymy.
Lexical synonymy relations, on the other hand, hold among utterances, or dialogues, when different words (or sequences of words) express mean-ings that are sufficiently similar in a particular domain. What is striking about lexical synonymy relations is that unlike syntactic/interactional ones, they can often break down when one moves to another domain: lexical synonymy relations are domain specific.
Eshghi & Lemon (2014) developed a method similar in spirit to Kwiatkowski et al. (2013) for capturing lexical synonymy relations by creating clusters of semantic representations based on observations that they give rise to similar or identical extra-linguistic actions observed within a domain (e.g. a data-base query, a flight booking, or any API call). Distributional methods could also be used for this purpose (see e.g. Lewis & Steedman (2013)). In general, this kind of clustering is achieved when the domain-general semantics resulting from semantic parsing is grounded in a particular domain.
We note that while interactional synonymy relations in dialogue should be accounted for by semantic grammars or formal models of dialogue structure (such as DS-TTR (Eshghi et al., 2012), or KoS (Ginzburg, 2012)), lexical synonymy relations have to be learned from data.

Why a grammar-based approach?
Recent end-to-end data-driven machine learning approaches treat dialogue as a sequence-tosequence generation problem, and train their models from large datasets e.g. (Sordoni et al., 2015;Wen et al., 2016b,a;Vinyals and Le, 2015). The systems resulting from these types of approach are in principle able to handle variations/patterns that they have encountered (sufficiently often) in the training data, but not beyond.
This large-data constraint is problematic for developers but is also strange when we consider the structural knowledge that we have about language and dialogue that can be encoded in grammars and computational models of interaction. Indeed, it is often stated that for humans to learn how to perform adequately in a domain, one example is enough from which to learn (e.g. Li et. al (2006)).
Furthermore, as these systems do not parse to logical forms, they do not allow for explicit inference, which further limits their application.
We therefore develop a method combining learning from data with an incremental semantic grammar of dialogue that is able to generalise from small number of observations in a domain -

Inducing Dialogue Systems
Our overall method involves incrementally parsing dialogues, and encoding the resulting semantics as state vectors in a Markov Decision Process (MDP), which is then used for Reinforcement Learning (RL) of word-level actions for system output (i.e. a combined incremental DM and NLG module for the resulting dialogue system).

Dynamic Syntax and Type Theory with
Records (DS-TTR) Dynamic Syntax (DS) is an action-based, wordby-word incremental and semantic grammar formalism (Kempson et al., 2001;Cann et al., 2005), especially suited to the highly fragmentary and context-dependent nature of dialogue. In DS, words are conditional actions -semantic updates; and dialogue is modelled as the interactive and incremental construction of contextual and semantic representations ) -see Fig. 2. The contextual representations afforded by DS are of the fine-grained semantic content that is jointly negotiated/agreed upon by the interlocutors, as a result of processing questions and answers, clarification interaction, acceptances, self-/other-corrections, restarts, and other characteristic incremental phenomena in dialogue -see 3 for a sketch of how self-corrections and restarts are processed via a backtrack and search mechanism over the parse search graph (see Hough (2011);;  for details of the model, and how this parse search graph is effectively the context of the conversation). Generation/linearisation in DS is defined using trialand-error parsing (see Section 3.2, with the provision of a generation goal, viz. the semantics of the utterance to be generated. Generation thus proceeds, just as with parsing, on a word-by-word basis (see ; Hough (2015) for details). The upshot of this is that using DS, we can not only track the semantic content of some current turn as it is being constructed (parsed or generated) word-by-word, but also the context of the conversation as whole, with the latter also encoding the grounded/agreed content of the conversation (see e.g. Fig. 4, and see ; Purver et al. (2010) for details). Crucially for our model below, the inherent incrementality of DS-TTR together with the word-level, as well as cross-turn, parsing constraints it provides, enables the word-by-word exploration of the space of grammatical dialogues, and the semantic and contextual representations that result from them.
Type Theory with Records (TTR) is an extension of standard type theory shown to be useful in semantics and dialogue modelling (Cooper, 2005;Ginzburg, 2012). To accommodate dialogue processing, and allow for richer representations of the dialogue context recent work has integrated DS and the TTR framework to replace the logical formalism in which meanings are expressed (Purver et al., 2010Eshghi et al., 2012). In TTR, logical forms are specified as record types (RTs), sequences of fields of the form [ l : T ] containing a label l and a type T . RTs can be witnessed (i.e. judged as true) by records of that type, where a record is a sequence of label-value pairs Fig. 2 for example record types). Importantly for us here, the standard subtype relation ⊑ can be defined for record types: where T 1 ⊑ T 2 . A record type can thus be indefinitely extended, and is therefore always underspecified by definition. This allows for incrementally growing meanings to be expressed in a natural way as more words are parsed or generated in : t x =robin : e p3 =sub j(event,x) : t x1 : turn. In addition, as will become clear below, this subtype checking operation is the key mechanism used in our system below for feature checking.

Overall Method: babble
In this section we describe our method for combining incremental dialogue parsing with Reinforcement Learning for Dialogue Management (DM) and Natural Language Generation (NLG) where these are treated as a joint decision/optimisation problem. We start with two resources: a) a DS-TTR parser DS (either learned from data (Eshghi et al., 2013a), or constructed by hand), for incremental language processing, but also, more generally, for tracking the context of the dialogue using Eshghi et al.'s model of feedback Eshghi, 2015;; b) a set D of transcribed successful dialogues in the target domain.
We perform the following steps overall to induce a fully incremental dialogue system from D: 1. Automatically induce the MDP state space, S , and the dialogue goal, G D , from D; 2. Automatically define the state encoding function F : C → S ; where s ∈ S is a (binary) state vector, designed to extract from the current context of the dialogue, the semantic features observed in the example dialogues D; and c ∈ C is a DS context, viz. a pair of TTR Record Types: ⟨c p , c g ⟩, where c p is the content of the current, PENDING clause as it is being constructed, but not necessarily fully grounded yet; and c g is the content already jointly built and GROUNDED by the interlocutors (loosely following the DGB model of (Ginzburg, 2012)).
3. Define the MDP action set as the DS lexicon L (i.e. actions are words); 4. Define the reward function R as reaching G D , while minimising dialogue length.
We then solve the generated MDP using Reinforcement Learning, with a standard Qlearning method, implemented using BURLAP (McGlashan, 2016): train a policy π : S → L, where L is the DS Lexicon, and S the state space induced using F. The system is trained in interaction with a (semantic) simulated user, also automatically built from the dialogue data and described in the next section.
The state encoding function, F As shown in figure 4 the MDP state is a binary vector of size 2 × |Φ|, i.e. twice the number of the RT features. The first half of the state vector contains the grounded features (i.e. agreed by the participants) ϕ i , while the second half contains the current semantics being incrementally built in the current dialogue utterance. Formally:

Semantic User Simulation
The simulator is in charge of two key tasks during training: (1) generating user turns in the right dialogue contexts; and (2) word-by-word monitoring of the utterance so far generated by the sys-Grounded Semantics Current Turn Semantics Dialogue so far : e p14 =by(x8,x10) : t  (1) and (2) use the full machinery of the DS parser, as well the state encoding function F, described above. They are thus performed based on the semantic context of the dialogue so far, as tracked by DS (rather than, e.g. being based on string or template matching). Since this includes not just the semantic features of the current turn, but also of the history of the conversation, our simulator respects the turn orderings encountered in the data, i.e. it is sensitive to the order in which information is gathered from the user. The rules required for (1) & (2) are extracted automatically from the raw dialogue data, D, using DS and F. The dialogues in D are parsed and encoded using F incrementally. For (1), all the states that trigger the user into action, s i = F(c) -where c is a DS context -immediately prior to any user turn are recorded, and mapped to what the user ends up saying in those contexts -for more than one training dialogue there may be more than one candidate (in the same context/state). The rules thus extracted will be of the form: s trig → {u 1 , . . . , u n }, where u i are user turns. Now note that the s i 's prior to the user turns also immediately follow system turns. And thus to perform (2), i.e. to monitor the system's behaviour during training, we only need to check further that the current state resulting from processing a word generated by the system, subsumes -is extendible to -one of the s i . We perform this through a sim-ple bitmask operation (recall that the states are binary). The simulation can thus semantically identify erroneous/out-of-domain actions (words) by the system. It would then terminate the learning episode and penalise the system immediately, aiding speed of training significantly.

Evaluation
We have so far induced two prototype dialogue systems, one in an 'electronics shopping' domain (see Kalatzis et al. (2016) and Fig. 1) and another in a 'restaurant-search' domain, showing that fully incremental dialogue systems can be automatically induced from small amounts of unannotated dialogue transcripts (Kalatzis et al., 2016; -in this case both systems were bootstrapped from a single successful example dialogue. We are in the process of evaluating these systems with real users. In this paper, however, our focus is not on building dialogue systems per se, but on: (1) studying and quantifying the interactional and structural generalisation power of the DS-TTR grammar formalism (see Section 2), and that of symbolic, grammar-based approaches to language processing more generally. We focus here on specific dialogue phenomena, such as mid-sentence self-corrections, hesitations, and restarts (see below); (2) doing the same for Bordes and Weston's (2017) state-of-the-art, bottom up response retrieval model, without use of linguistic knowledge of any form; and (3) comparing (1) and (2).
In order to test and quantify the interactional and structural generalisation power/robustness of the two models, babble and memn2n, we need contrasting dialogue data-sets that control for interactional vs. lexical variations in the input dialogues. Furthermore, to make our results comparable to the existing approach of Bordes and Weston (2017), we need to use the same dataset that they have used. We therefore use Facebook AI Research's bAbI dialogue tasks dataset (Bordes et al., 2017). These are goal-oriented dialogues in the domain of restaurant search. Here we tackle Task 1, where in each dialogue the system asks the user about their preferences for the properties of a restaurant, and each dialogue results in an API call which contains values of each slot obtained. Other than the explicit API call notation, there are no annotations in the data whatsoever.

The bAbI+ dataset
While containing some lexical variation, the original bAbI dialogues significantly lack interactional variation vital for natural real-life dialogue. In order to obtain such variation while holding lexical variation constant, we created the bAbI+ dataset by systematically transforming the bAbI dialogues. bAbI+ is an extension of the original bAbI Task 1 dialogues with everyday incremental dialogue phenomena (hesitations, restarts, and corrections -see below). While the original bAbI tasks 2-7 increase the user's goal complexity, modifications introduced in bAbI+ can be thought of as orthogonal to this: we instead increase the complexity of surface forms of dialogue utterances, while keeping every other aspect of the task fixed.
The variations introduced in bAbI+ are: 1. Hesitations, e.g. as in "we will be uhm eight"; 2. Restarts, e.g. "can you make a restaurant uhm yeah can you make a restaurant reservation for four people with french cuisine in a moderate price range"; and 3. Corrections affecting task-specific information -both short-distance ones correcting one token, e.g. "with french oh no spanish food", and long-distance NP/PP-level corrections, e.g. "with french food uhm sorry with spanish food".
The phenomena above are mixed in probabilistically from the fixed sets of templates to the original data 2 . The modifications affect a total of 11336 utterances in the 3998 dialogues. Around 21% of user turns contain corrections, 40% hesitations, and 5% restarts (they are not mutually exclusive, so that an utterance can contain up to 3 modifications). Our modifications, with respect to corrections in particular, are more conservative than those observed in real-world data: Hough (2015) reports that self-corrections appear in 20% of all turns of natural conversations from the British National Corpus, and in 40% of turns in the Map Task, a corpus of human-human goal-oriented dialogues. Here's part of an example dialogue in the bAbI+ corpus: sys: hello what can I help you with today? usr: I'd like to book a uhm yeah I'd like to book a table in a expensive price range sys: I'm on it. Any preference on a type of cuisine? usr: with indian food no sorry with spanish food please

Memory Network setup
In all the experiments we describe below, we follow Bordes and Weston's setup by using a memn2n (we took an open source Tensorflow implementation for bAbI QA tasks and modified it 3 according to their setup -see details below). In order to adapt the data for the memn2n, we transform the dialogues into <story, question, answer> triplets. The number of triplets for a single dialogue is equal to the number of the system's turns, and in each triplet, the answer is the current system's turn, the question is the user's turn preceding it, and the story is a list of all the previous turns among both sides. The memn2n hyperparameters are set as follows: 1 hop, and 128 as the size of embeddings; we train it for 100 epochs with a learning rate of 0.01 and a batch size of 8 -in this we follow the best bAbI Task 1 setup reported by (Bordes et al., 2017).

Testing the DS-TTR parser
Dynamic Syntax (DS) lexicons are learnable from data (Eshghi et al., 2013a,b). But since the lexicon was induced from a corpus of child-directed utterances in this prior work, there were some constructions as well as individual words that it did not include 4 . One of the authors therefore extended this induced grammar manually to cover the bAbI dataset, which, despite not being very diverse, contains a wide range of complex grammatical constructions, such as long sequences of prepositional phrases, adjuncts, short answers to yes/no and wh-questions, appositions of NPs, causative verbs etc.
We parsed all dialogues in the bAbI train and test sets, as well as on the bAbI+ corpus word-byword, including both user and system utterances, in context. The grammar parses 100% of the dialogues, i.e. it does not fail on any word in any of the dialogues. We assess the semantic accuracy of the parser on bAbI & bAbI+ using the dialoguefinal api-calls in section 4.5 below.

Experiment 1: Generalisation from small data
We have now set out all we need to perform the first experiment. Our aim here is to assess the generalisation power that results from the grammar and our state encoding method (section 3.1) -we dub our overall model babble -and compare this to the state of the art results of Bordes et al. (2017). The method in Bordes et al. (2017) is not generative, rather it is based on retrieval of system responses, based on the history of the dialogue up to that point. Therefore, for direct comparison, and for simplicity of exposition, we do the same here: we apply the method described for creating a user simulation (section 3.2.1), this time for the system side, resulting in a 'system simulation'. We then use this to predict a system response, by parsing and encoding the containing test dialogue up to the point immediately prior to the system turn. This results in a triggering state, s trig , which is then used as the key to look up the system's response from the rules constructed as per section 3.2.1. The returned response is then parsed wordby-word as normal, and this same process continues for the rest of the dialogue. This method uses the full machinery of DS-TTR & our stateencoding method -the babble model -and will thus reflect the generalisation properties that we are interested in.
Cross-Validation Since we are here interested in data efficiency and generalisation we use all the bAbI and bAbI+ data -the train, dev, and test sets -as follows: we train Bordes & Weston's memn2n and babble from 1-5 examples selected at random from the longest dialogues in bAbI -note bAbI+ data is never used for training in these experiments. This process is repeated across 10 folds.
The models are then tested on sets of 1000 examples selected at random, in each fold. Both the training and test sets constructed in this way are kept constant in each fold across the babble & memn2n models. The test sets are selected either exclusively from bAbI or exclusively from bAbI+. Table 1 shows per utterance accuracies for the babble & memn2n models. Per utterance accuracy is the percentage of all system turns in the test dialogues that were correctly predicted. The table shows that babble can generalise to a remarkable 74% of bAbI and 65% of bAbI+ with only 5 input dialogues from bAbI. It also shows that memn2ns can also generalise remarkably well. Although as discussed below, this result is misleading on its own as the memn2ns are very poor at generating the final api-calls correctly on both the bAbI & bAbI+ data, and are thus making too many semantic mistakes.

Experiment 2: Semantic Accuracy
The results from Experiment 1 on their own can be misleading, as correct prediction of system responses does not in general tell us enough about how well the models are interpreting the dialogues, or whether they are doing this with a sufficient level of granularity. To assess this, in this second experiment, we measure the semantic accuracy of each model by looking exclusively at how accurately they predict the final api-calls in the bAbI & bAbI+ datasets. For the memn2n model, we follow the same overall procedure as in the previous experiment: train on bAbI data, and test on bAbI+.

Results: Prediction of api-calls
BABBLE Mere successful parsing of all the dialogues in the bAbI and bAbI+ datasets as shown above doesn't mean that the semantic representations compiled for the dialogues were in fact correct. To measure the semantic accuracy of the DS-TTR parser we programmatically checked that the correct slot values -those in the api-call annotations -were in fact present in the semantic representations produced by the parser for each dialogue (see Fig. 2 for example semantic representations). We further checked that there is no other incorrect slot value present in these representations. The results showed that the parser has 100% se-# of training dialogues: mantic accuracy on both bAbI and bAbI+ 5 . This result is not surprising, given that DS-TTR is a general model of incremental language processing, including phenomena such as self-corrections and restarts (see Hough (2015) for details of the model).
MEMN2N Given just 1 to 5 training instances from bAbI as in the previous experiment, the mean api-call prediction accuracy of the memn2n model is nearly 0 on both bAbI and bAbI+. This is not at all unexpected, since prediction of the api-calls is a generative process, unlike the prediction of system turns which can be done on a retrieval/look-up basis alone. For this, the model needs to observe the different word sequences that might determine each parameter (slot) value, and observe them with sufficient frequency and variation. This is unlike a semantic parser like DS-TTR, that produces semantic representations for the dialogues as a result of the structural, linguistic knowledge that it embodies. Nevertheless, we were also interested in the general semantic robustness of the memn2n model, to the transformations in bAbI+, i.e. how well does the memn2n model interpret bAbI+ dialogues, when trained on the full bAbI dataset? Does it then learn to generalise to (process) the bAbI+ dialogues with sufficient semantic accuracy? Table 2 shows that we can fully replicate the results reported in Bordes et al. (2017): the memn2n model can predict the api-calls with 100% accuracy, when trained on the bAbI train-set and tested on the bAbI test-set. But when this same model is tested on bAbI+, the accuracy drops to a 5 A helpful reviewer points out that the DS-TTR setup is a carefully tuned rule-based system, thus perhaps rendering these results trivial. But we note that the results here are not due to ad-hoc constructions of rules/lexicons, but due to the generality of the grammar model, and its attendant incremental, left-to-right properties; and that the same parser can be used in other domains. Furthermore, the ability to process self-corrections, restarts, etc. "comes for free", without the need to add or posit new machinery testing configuration accuracy memn2n on bAbI 100 memn2n on bAbI+ 28 Table 2: api-call prediction accuracies (%) for the memn2n model trained on the bAbI trainset very poor 28%, making any dialogue system built using this model unusable in the face of natural, spontaneous dialogue data. This is further discussed below.

babble
The method described above has the following advantages over previous approaches to dialogue system development: -incremental (and thus more natural) language understanding, dialogue management, and generation; -"end-to-end" method for task-based systems: no Dialogue Act annotations are required (i.e. reduced development time and effort); -a complete dialogue system for a new task can be automatically induced, using only 'raw' datai.e. successful dialogue transcripts; -the MDP state and action spaces are automatically induced, rather than having to be designed by hand (as in prior work); -wide-coverage, task-based dialogue systems can be built from much smaller amounts of data as shown in section 4 .
This final point bears further examination. As an empirically adequate model of incremental language processing in dialogue, the DS-TTR grammar is required to capture interactional variants such as question-answer pairs, over-and underanswering, self-and other-corrections, clarification, split-utterances, and ellipsis more generally. As we showed in section 4, even if most of these structures are not present in the training exam-ple(s), the resulting trained system is able to handle them, thus resulting in a very significant generalisation around the original data.
We also note that since we were in this instance interested in a direct comparison with memn2ns over the bAbI & bAbI+ datasets, we didn't exploit the power of Reinforcement Learning and exploration as we described above -as we have done before with other systems (Kalatzis et al., 2016). Therefore the generalisation results we report above for babble follow entirely from the knowledge present within the grammar as a computational model of dialogue processing and contextual update, rather than this having been learned from data. Applying the full RL method described above would have meant that the system would actually discover many interactional and syntactic variations that are not present in bAbI, nor in bAbI+.

memn2n
Even when trained on very few training instances, the memn2n model was able to predict system responses remarkably well. But results from Experiment 2 above showed that this was misleading: the memn2ns were making a drastic number of semantic mistakes when interpreting the dialogues, both in the bAbI and bAbI+ datasets. Even when trained on the full bAbI data-set, the model performed badly on bAbI+ in terms of semantic accuracy. We diagnose these results as follows: Problem complexity: The first thing to notice is that in bAbI dialogue Task 1, the responses are highly predictable and stay constant regardless of the actual task details (slot values) up to the point of the final api-calls; and further, that the prediction of api-calls is a generative process, unlike the prediction of the system turns, which is retrieval-based. This, in our view, explains the very large difference in memn2n performance across the two prediction tasks.
Model robustness to the bAbI+ transformations:. The variations introduced in bAbI+ are repetitions of both content and non-content words, as well additional incorrect slot values. The model was working in the same setup as babble, therefore none of those variations could be treated as unknown tokens for either system. Although in the case of memn2n, some of the mixed-in words never appeared in the training data, and bAbI+ utterances were augmented significantly with those words -so it was interesting to see how such untrained embeddings would affect the latent memory representations inside memn2n. The resulting performance suggests that there was no significant impact on memn2n from these variations as far as the predicting system responses was concerned. But the incorrect slot values introduced in self-corrections affect the system's task completion performance significantly, only appearing at the point of api-call predictions.
We note also that none of our experiments in this paper involved training memn2n on bAbI+ data. There is a very interesting question here: is the memn2n model in principle able to learn to process the bAbI+ structures if it is in fact trained on it? And how much bAbI+ data would it require to do so? These issues are address in detail in .

Conclusions
Our main advances are in a) training end-to-end dialogue systems from small amounts of data, b) incremental processing for wider coverage of more natural everyday dialogues (e.g. containing self-repairs).
We compared our grammar-based approach to dialogue processing (DS-TTR) with a stateof-the-art, end-to-end response retrieval model (memn2ns) (Bordes et al., 2017), when training on small amounts of dialogue data.
Our experiments show that our model can process 74% of the Facebook AI bAbI dataset even when trained on only 0.13% of the data (5 dialogues). It can in addition process 65% of bAbI+, a corpus we created by systematically adding incremental dialogue phenomena such as restarts and self-corrections to bAbI. We find on the other hand that the memn2n model is not robust to the structures we introduced in bAbI+, even when trained on the full bAbI dataset.