Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations

We introduce Span-ConveRT, a light-weight model for dialog slot-filling which frames the task as a turn-based span extraction task. This formulation allows for a simple integration of conversational knowledge coded in large pretrained conversational models such as ConveRT (Henderson et al., 2019). We show that leveraging such knowledge in Span-ConveRT is especially useful for few-shot learning scenarios: we report consistent gains over 1) a span extractor that trains representations from scratch in the target domain, and 2) a BERT-based span extractor. In order to inspire more work on span extraction for the slot-filling task, we also release RESTAURANTS-8K, a new challenging data set of 8,198 utterances, compiled from actual conversations in the restaurant booking domain.


Introduction
Conversational agents are finding success in a wide range of well-defined tasks such as customer support, restaurant, train or flight bookings (Hemphill et al., 1990;Williams, 2012;El Asri et al., 2017;Budzianowski et al., 2018), language learning (Raux et al., 2003;Chen et al., 2017), and also in domains such as healthcare (Laranjo et al., 2018) or entertainment (Fraser et al., 2018). Scaling conversational agents to support new domains and tasks, and particular system behaviors is a highly challenging and resource-intensive task: it critically relies on expert knowledge and domain-specific labeled data (Williams, 2014;Wen et al., 2017b,a;Liu et al., 2018;. Slot-filling is a crucial component of any taskoriented dialog system (Young, 2002(Young, , 2010Bellegarda, 2014). For instance, a conversational agent for restaurant bookings must fill all the slots date, time and number of guests with correct values given by the user (e.g. tomorrow, 8pm, 3 people) in order to proceed with a booking. A particular challenge is to deploy slot-filling systems in low-data regimes (i.e., few-shot learning setups), which is needed to enable quick and wide portability of conversational agents. Scarcity of in-domain data has typically been addressed using domain adaption from resource-rich domains, e.g. through multitask learning (Jaech et al., 2016;Goyal et al., 2018) or ensembling (Jha et al., 2018;Kim et al., 2019).
In this work, we approach slot-filling as a turnbased span extraction problem similar to Rastogi et al. (2019): in our Span-ConveRT model we do not restrict values to fixed categories, and simultaneously allow the model to be entirely independent of other components in the dialog system. In order to facilitate slot-filling in resource-lean settings, our main proposal is the effective use of knowledge coded in representations transferred from large general-purpose conversational pretraining models, e.g., the ConveRT model trained on a large Reddit data set (Henderson et al., 2019a).
To help guide other work on span extractionbased slot-filling, we also present a new data set of 8,198 user utterances from a commercial restaurant booking system: RESTAURANTS-8K. The data set spans 5 slots (date, time, people, first name, last name) and consists of actual user utterances collected "in the wild". This comes with a broad range of natural and colloquial expressions, 1 as illustrated in Figure 1, which makes it both a natural and challenging benchmark. Each training example is a dialog turn annotated with the slots requested by the system and character-based span indexing for all occurring values.
As our key findings show, conversational pre- training is instrumental to span extraction performance in few-shot setups. By using subword representations transferred from ConveRT (Henderson et al., 2019a), we demonstrate that: 1) our ConveRT-backed span extraction model outperforms the model based on transferred BERT representations, and 2) it also yields consistent gains over a span extraction model trained from scratch in the target domains, with large gains reported in few-shot scenarios. We verify both findings on the new RESTAURANTS-8K data set, as well as on four DSTC8-based data sets (Rastogi et al., 2019). All of the data sets used in this work are available online at: https://github. com/PolyAI-LDN/task-specific-datasets.

Methodology: Span-ConveRT
Before we delve into describing the core methodology, we note that in this work we are not concerned with the task of normalizing extracted spans to their actual values: this can be solved effectively with rule-based systems after the span extraction step for cases such as times, dates, and party sizes. There exist hierarchical rule-based parsing engines (e.g., Duckling) that allow for parsing times and dates such as "the day after next Tuesday". Further, phrases such as "Me and my wife and 2 kids" can be parsed using singular noun and number counts in the span with high precision.
Span Extraction for Dialog. We have recently witnessed increasing interest in intent-restricted approaches (Coucke et al., 2018;Goo et al., 2018; for slot-filling. In this line of work, slot-filling is treated as a span extraction problem where slots are defined to occur only with certain intents. This solves the issue of complex categorical modeling but makes slot-filling dependent on an intent detector. Therefore, we propose a framework that treats slot-filling as a fully intentagnostic span extraction problem. Instead of using rules to constrain the co-occurrence of slots and intents, we identify a slot as either a single span of text or entirely absent. This makes our approach more flexible than prior work; it is fully independent of other system components. Regardless, we can explicitly capture turn-by-turn context by adding an input feature denoting whether a slot was requested for this dialog turn (see Figure 1).
Pretrained Representations. Large-scale pretrained models have shown compelling benefits in a plethora of NLP applications (Devlin et al., 2019;Liu et al., 2019): such models drastically lessen the amount of required task/domain-specific training data with in-domain fine-tuning. This is typically achieved by adding a task-specific output layer to a large pretrained encoder and then finetuning the entire model . However, this process requires a fine-tuned model for each slot or domain, rather than a single model shared across all slots and domains. This adds a large memory and computational overhead and makes the approach impractical in real-life applications. Therefore, we propose to keep the pretrained encoder models fixed in order to emulate a production system where a single encoder model is used. 2 Underlying Representation Model: ConveRT.
ConverRT (Henderson et al., 2019a) is a lightweight sentence encoder implemented as a dualencoder network that models the interaction between inputs/contexts and relevant (follow-up) responses. In other words, it performs conversational pretraining based on response selection on the Reddit corpus (Henderson et al., 2019a,b). It utilizes subword-level tokenization and is very compact and resource-efficient (i.e. it is 59MB in size and can be trained in less than 1 day on 12 GPUs) while achieving state-of-the-art performance on conversational tasks (Casanueva et al., 2020;Bunk et al., 2020). Through pretrained ConveRT representa-tions, we can leverage conversational cues from over 700M conversational turns for the few-shot span extraction task. 3 Span ConveRT: Final Model. We now describe our model architecture, illustrated in Figure 2. Our approach builds on established sequence tagging models using Conditional Random Fields (CRFs) (Ma and Hovy, 2016;Lample et al., 2016). We propose to replace the LSTM part of the model with fixed ConveRT embeddings. 4 We take contextualized subword embeddings from ConveRT, giving a sequence of the same length as the subwordtokenized sentence. For sequence tagging, we train a CNN and CRF on top of these fixed subword representations. We concatenate three binary features to the subword representations to emphasize important textual characteristics: (1) whether the token is alphanumeric, (2) numeric, or (3) the start of a new word. In addition, we concatenate the character length of the token as another integer feature. To incorporate the requested slots feature, we concatenate a binary feature representing if the slot is requested to each embedding in the sequence. To contextualize the modified embeddings, we apply a dropout layer followed by a series of 1D convolutions of increasing filter width. Spans are represented using a sequence of tags, indicating which members of the subword token sequence are in the span. We use a tag representation similar to the IOB format annotating the span with a sequence of before, begin, inside and after tags, see Figure 2 for an example.
The distribution of the tag sequence is modeled with a CRF, whose parameters are predicted by a CNN that runs over the contextualized subword embeddings v. At each step t, the CNN outputs a 4 × 4 matrix of transition scores W t and a 4dimensional vector of unary potentials u t . The probability of a predcited tag sequence y is then modeled as: The loss is the negative log-likelihood, equal to minus the sum of the transition scores and unary 3 As we show later in §4, we can also leverage BERT-based representations in the same span extraction framework, but our ConveRT-based span extractors result in higher performance. 4 LSTMs are known to be computationally expensive and require large amounts of resources to obtain any notable success (Pascanu et al., 2013). By utilizing ConveRT instead, we arrive at a much more lightweight and efficient model.  potentials that correspond to the true tag labels, up to a normalization term. The top scoring tag sequences can be computed efficiently using the Viterbi algorithm.

Experimental Setup
New Evaluation Data Set: RESTAURANTS-8K. Data sets for task-oriented dialog systems typically annotate slots with exclusively categorical labels (Budzianowski et al., 2018). While some data sets such as SNIPS (Coucke et al., 2018) or ATIS (Tür et al., 2010) do contain span annotations, they are built with single-utterance voice commands in mind rather than a natural multi-turn dialog. To fill this gap and enable more work on span extraction for dialog, we introduce a new data    (3) property viewing (Homes_1) and renting cars (RentalCars_1). A detailed description of the data extraction protocol and the statistics of the data sets, also released with this paper, are available in appendix A.
Baseline Models. We compare our proposed 5 The data set contains some challenging examples where multiple values are mentioned, or values are mentioned that do not pertain to a slot. For example, in the utterance "I said 5pm not 6pm" multiple times are mentioned; in "I called earlier today" a date is mentioned that is not the day of the booking. Further, there are noticeable differences compared to previous data sets such as DSTC8 (Rastogi et al., 2019): e.g., while all slots in other datasets which pertained to integers (e.g. the number of travelers for a coach journey, number of tickets for an event booking) are modeled categorically (i.e. all numbers from 1 to 10 are separate classes), we model the number of people coming for a booking using spans because people often mention this value indirectly. For example me and my husband, 3 adults, 4 kids, 2 couples. model with two strong baselines: V-CNN-CRF is a vanilla approach that uses no pretrained model and instead learns sub-word representations from scratch. Span-BERT uses fixed BERT subword representations. All use the same CNN+CRF architecture on top of the subword representations. For each baseline, we conduct hyper-parameter optimization similar to Span-ConveRT: this is done via grid search and evaluation on the development set of RESTAURANTS-8K. The final sets of hyperparameters are provided in Table 2. Span-BERT relies on BERT-base, with 12 transformer layers and 768-dim embeddings. ConveRT uses 6 transformer layers with 512-dim embeddings, so it is roughly 3 times smaller.
Following prior work (Coucke et al., 2018;Rastogi et al., 2019), we report the F 1 scores for extracting the correct span per user utterance. If the models extract part of the span or a longer span, this is treated as an incorrect span prediction.
Few-Shot Scenarios. For both data sets, we measure performance on smaller sets sampled from the full data. We gradually decrease training sets in size whilst maintaining the same test set: this provides insight on performance in low-data regimes.

Results and Discussion
The results across all slots are summarized in Table 3 for RESTAURANTS-8K, and in Table 4 for DSTC8. First, we note the usefulness of conversational pretraining and transferred representations: Span-ConveRT outperforms the two baselines in almost all evaluation runs, and the gain over V-CNN-CRF directly suggests the importance of transferred pretrained conversational representations. Second, we note prominent gains with Span-ConveRT especially in few-shot scenarios with reduced training data: e.g., the gap over V-CNN-CRF widens from 0.02 on the full RESTAURANTS  lar trends are observed on all four DSTC8 subsets. Again, this indicates that general-purpose conversational knowledge coded in ConveRT can indeed boost dialog modeling in low-data regimes. If sufficient domain-specific data is available (e.g., see the results of V-CNN-CRF with full data), learning domain-specialized representations from scratch can lead to strong performance, but using transferred conversational representations seems to be widely useful and robust. We also observe consistent gains over Span-BERT, and weaker performance of Span-BERT even in comparison to V-CNN-CRF in some runs (see Table 3). These results indicate that for conversational end-applications such as slot-filling, pretraining on a conversational task (such as response selection) is more beneficial than standard language modeling-based pretraining. Our hypothesis is that both the vanilla baseline and ConveRT leverage some "domain adaptation": ConveRT is trained on rich conversational data, while the baseline representations are learned directly on the training data. BERT, on the other hand, is not trained on conversational data directly and usually relies on much longer passages of text. This might not make the BERT representations suitable for conversational tasks such as span extraction. Similar findings, where ConveRT-based conversational representations outperform BERT-based baselines (even with full fine-tuning), have recently been established in other dialog tasks such as intent detection (Henderson et al., 2019a;Casanueva et al., 2020;Bunk et al., 2020). In general, our findings also call for investing more effort in investigating different pretraining strategies that are better aligned to target tasks (Mehri et al., 2019;Henderson et al., 2019a;Humeau et al., 2020).
Error Analysis. To better understand the performance of Span-ConveRT on the RESTAURANTS-8K data set, we also conducted a manual error analysis, comparing it with the best performing baseline model, V-CNN-CRF. In Appendix C we lay out the types of errors that occur in a generic span extraction task and investigate the distribution of these types of errors across slots and models. We show that when trained in the high-data setting the distribution is similar between the two models, suggesting that gains from Span-ConveRT are across all types of error. We also show that the distribution varies more in the low-data setting and discuss how that might impact their comparative performance in practice. Additionally, in Appendix D we provide a qualitative analysis on the errors the two models make for the slot first name. We show that the baseline model has a far greater tendency to wrongly identify generic out-of-vocabulary words as names.

Conclusion and Future Work
We have introduced Span-ConveRT, a light-weight model for dialog slot-filling that approaches the problem as a turn-based span extraction task. The formulation allows the model to effectively leverage representations available from large-scale conversational pretraining. We have shown that, due to pretrained representations, Span-ConveRT is especially useful in few-shot learning setups on small data sets. We have also introduced RESTAURANTS-8K, a new challenging data set that will hopefully encourage further work on span extraction for dialogue. In future work, we plan to experiment with multi-domain span extraction architectures.

A DSTC8 Datasets: Data Extraction and Statistics
As discussed in §3, we extract span annotated data sets from the Schema Guided Dialog Dataset (SGDD) in four different domains. SGDD is a multi-domain data set with each domain consisting of several sub-domains. As the data set has been built for transfer learning from one domain to another, many sub-domains only exist in either the training or development data sets. We are interested in single-domain dialog, and therefore chose datasets from four different domains of the original dataset: (1) bus and coach booking, (2) buying tickets for events, (3) property viewing and renting cars. We select these domains due to their high number of conversations and their large variety of slots (e.g. area of city to view an apartment, type of event to attend, time/date of coach to book). For each of these domains, we chose their first subdomain 6 , and took all turns from conversations that stay within this sub-domain. For the requested slots feature, we check for when the system action of the turn prior contains a REQUEST action. The training and development split is kept the same for all extracted turns. Table 5 shows the resulting data set sizes for each sub-domain. We are releasing these filtered single-domain data sets, along with the code to create them from the original SGDD data.    When training on the full training set (Figure 3), there is little difference in error breakdown between Span-ConveRT and V-CNN-CRF. This suggests the behavior of these models is similar when trained in a high-data setting, but improvements made by Span-ConveRT are on all fronts.
When trained on a 16th of the dataset (Figure 4), the difference between the models becomes more pronounced. Most notably, the Span-ConveRT model produces a greater proportion of type 4 errors compared to the V-CNN-CRF model on every slot. This suggests that the errors Span-ConveRT makes, although not precisely correct with its span prediction, are more likely to yield a span that could parse to a correct value. For example, consider the sentence "a table for 8pm this evening". The correct span for the slot time is "8pm", but if a model erroneously predicts "8pm this evening" (a span which overlaps the label span) it will still parse to the same time as the label span.

D Qualitative Error Analysis of
Span-ConveRT and V-CNN-CRF on

RESTAURANTS-8K
As an accompaniment to the quantitative results, we provide a brief qualitative analysis of errors in the best performing models. Considering only the first name slot, we collect the errors made on the test set that are exclusive to each model. That left 10 errors for Span-ConveRT and 50 for V-CNN-CRF. Along with our analysis based on the full set of 60 errors, we provide a random sample of 5 errors from each model in Tables 8 and 9. A large portion of the errors exclusively made by V-CNN-CRF were predictions of spans where no name was mentioned. Many words that are not standard to the domain of restaurant booking were, often confidently, wrongly predicted as names. For example, in Table 9 we show that the words "bloody", "web", "animal" and "spread" were all predicted as first names by the baseline model. Employing transferred conversational representations evidently lessens the likelihood of these forms of errors occurring. Also included in the table is an example where the baseline model fails to recognize a name which, when corroborated with similar occurrences in the wider set of errors, suggests that it is less likely to predict spans for out-of-vocabulary names than Span-ConveRT.
As well as backing up the conclusions formed by our numerical results, we were also interested in what ways using pretrained representations might hinder performance. With only 10 errors exclusively made by Span-ConveRT it was not possible to form any sweeping conclusions but a handful of errors suggest that the model might employ its background knowledge to reject unfamiliar first names or accept familiar ones in spite of the sentence structure suggesting otherwise. For example, in the first row of Table 8 we find that the model rejects the name "Wen" despite it being part of a fairly common exchange for this domain and in a natural place for a first name. The other examples demonstrate that the model can sometimes predict last names as first names and in spite of contextual cues suggesting otherwise, can do so over-confidently.   Table 9: Random sample of errors exclusively made by V-CNN-CRF for the slot first name. Red text denotes incorrectly predicted spans and orange denotes true spans that were not predicted.