ConVEx: Data-Efficient and Few-Shot Slot Labeling

We propose ConVEx (Conversational Value Extractor), an efficient pretraining and fine-tuning neural approach for slot-labeling dialog tasks. Instead of relying on more general pretraining objectives from prior work (e.g., language modeling, response selection), ConVEx’s pretraining objective, a novel pairwise cloze task using Reddit data, is well aligned with its intended usage on sequence labeling tasks. This enables learning domain-specific slot labelers by simply fine-tuning decoding layers of the pretrained general-purpose sequence labeling model, while the majority of the pretrained model’s parameters are kept frozen. We report state-of-the-art performance of ConVEx across a range of diverse domains and data sets for dialog slot-labeling, with the largest gains in the most challenging, few-shot setups. We believe that ConVEx’s reduced pretraining times (i.e., only 18 hours on 12 GPUs) and cost, along with its efficient fine-tuning and strong performance, promise wider portability and scalability for data-efficient sequence-labeling tasks in general.


Introduction
Slot labeling or slot filling is a critical natural language understanding (NLU) component of any task-oriented dialog system (Young, 2002(Young, , 2010Tür and De Mori, 2011, inter alia). Its goal is to fill the correct values associated with predefined slots: e.g., a dialog system for restaurant bookings is expected to fill slots such as date, time, and the number of guests with the values extracted from a user utterance (e.g., next Thursday, 7pm, 4 people).
Setting up task-oriented dialog systems, as well as slot labeling methods in particular, to support new tasks and domains is highly challenging due to inherent scarcity of expensive expertannotated data for a plethora of intended use scenarios (Williams, 2014;Henderson et al., 2014;Budzianowski et al., 2018;Zhao et al., 2019). One plausible and promising solution is the creation of data-efficient models that learn from only a handful annotated examples in few-shot scenarios. This approach has been shown promising for learning intent detectors Krone et al., 2020;Bunk et al., 2020) as well as for slot-filling methods (Hou et al., 2020;Coope et al., 2020).
The dominant paradigm followed by the existing models of few-shot slot labeling is transfer learning (Ruder et al., 2019): 1) they rely on representations from models pretrained on large data collections in a self-supervised manner on some general NLP tasks such as (masked) language modeling (Devlin et al., 2019;Brown et al., 2020) or response selection (Henderson et al., 2019bCer et al., 2018); and then 2) add additional task-specific layers for modeling the input sequences. However, we detect several gaps with the existing setup, and set to address them in this work. First, recent work in NLP has validated that a stronger alignment between a pretraining task and an end task can yield performance gains for tasks such as extractive question answering (Glass et al., 2020) and paraphrase and translation (Lewis et al., 2020). We ask whether it is possible to design a pretraining task which is more suitable for slot labeling in conversational applications. Second, is it possible to bypass learning sequence-level layers from scratch, and simply fine-tune them after pretraining instead? Third, is it possible to build a generally applicable model which fine-tunes pretrained "general" sequence-level layers instead of requiring specialized slot labeling algorithms from prior work (Krone et al., 2020;Hou et al., 2020)?
Inspired by these challenges, we propose Con-VEx (Conversational Value Extractor), a novel Transformer-based neural model which can be pretrained on large quantities of natural language data (e.g., Reddit) and then directly fine-tuned to a variety of slot-labeling tasks. Similar to prior work (Rastogi et al., 2019;Coope et al., 2020), ConVEx casts slot labeling as a span-based extraction task. For ConVEx, we introduce a new pretraining objective, termed pairwise cloze. This objective aligns well with the target downstream task: slot labeling for dialog, and emulates slot labeling relying on unlabeled sentence pairs from natural language data which share a keyphrase (i.e., a "value" for a specific "slot"). Instead of learning them from scratch as in prior work (Coope et al., 2020), ConVEx's pretrained Conditional Random Fields (CRF) layers for sequence modeling are fine-tuned using a small number of labeled in-domain examples.
We evaluate ConVEx on a range of diverse dialog slot labeling data sets spanning different domains: DSTC8 data sets (Rastogi et al., 2019), RESTAURANTS-8K (Coope et al., 2020), and SNIPS (Coucke et al., 2018). ConVEx yields stateof-the-art performance across all evaluation data sets, but its true usefulness and robustness come to the fore in the few-shot scenarios. For instance, it increases average F 1 scores on RESTAURANTS-8K over the previous state-of-the-art model (Coope et al., 2020) from 40.5 to 71.7 with only 64 labeled examples. Similar findings are observed with DSTC8, and we also report state-of-the-art performance in the 5-shot slot labeling task on SNIPS.
In summary, our results validate the benefits of task-aligned pretraining from raw natural language data, with particular gains for data-efficient slot labeling given a limited number of annotated examples, which is a scenario typically met in production. They also clearly demonstrate that competitive performance can be achieved via quick fine-tuning, without heavily engineered specialized methods from prior work (Hou et al., 2020). Further, we validate that learning sequence-level layers from scratch is inferior to fine-tuning from pretrained layers. From a broader perspective, we hope that this research will inspire further work on task-aligned pretraining objectives for other NLP tasks beyond slot labeling. From a more focused perspective, we hope that it will guide new approaches to data-efficient slot labeling for dialog.

Methodology
Before we delve deeper into the description of ConVEx in §2.3, in §2.1 we first describe a novel sentence-pair value extraction pretraining task used by ConVEx, called pairwise cloze, and then in §2.2 a procedure that converts "raw" unlabeled natural language data into training examples.

Pretraining Task: Pairwise Cloze
Why Pairwise Cloze? Top performing natural language understanding models typically make use of neural nets pretrained on large scale data sets with unsupervised objectives such as language modeling (Devlin et al., 2019;Liu et al., 2019) or response selection Humeau et al., 2020). For sequential tasks such as slot labeling, this involves adding new layers and training them from scratch, as the pretraining procedure does not involve any sequential decoding; therefore, current unsupervised pretraining objectives are suboptimal for sequence-labeling tasks. With ConVEx, we introduce a new pretraining task with the following properties: 1) it is more closely related to the target slot-labeling task, and 2) it facilitates training all the necessary layers for slot-labeling, so these can be fine-tuned rather than learned from scratch.
What is Pairwise Cloze? In a nutshell, given a pair of sentences that have a keyphrase in common, the task treats one sentence as a template sentence and the other as its corresponding input sentence. For the template sentence, the keyphrase is masked out and replaced with a special BLANK token. The model must then read the tokens of both sentences, and predict which tokens in the input sentence constitute the masked phrase. Some examples of such pairs extracted from Reddit are provided in Table 1. The main idea is to teach the model an implicit space of slots and values, where during self-supervised pretraining, slots are represented as the contexts in which a value might occur. The model than gets fine-tuned later to fit domain-specific slot labeling data. 1

Pairwise Cloze Data Preparation
Input Data. We assume working with the English language throughout the paper. Reddit has been shown to provide natural conversational English data for learning semantic representations that work well in downstream tasks related to dialog and conversation (Al-Rfou et al., 2016;Cer et al., 2018;Henderson et al., 2019bCoope et al., 2020). Therefore, following 1 The pairwise cloze task has been inspired by the recent span selection objective applied to extractive QA by Glass et al. (2020): they create examples emulating extractive QA pairs with long passages and short question sentences. Another similar approach to extractive QA has been proposed by Ram et al. (2021). In contrast, our work seeks to emulate slot labeling in a dialog system by creating examples from short conversational utterances. Toy Story 3 ended perfectly, but Disney just wants to keep milking it. It really sucks, as the V30 only has BLANK . Maybe the Oreo update will add this.
Thanks for the input, but 64GB is plenty for me :) I took BLANK, cut it to about 2 feet long and duct taped Vive controllers on each end. Works perfect Yeah, I just duct taped mine to a broom stick. You can only play no arrows mode but it's really fun. I had BLANK and won the last game and ended up with 23/20 and still didn't get it. I know how you feel my friend and I got 19/20 on the tournament today Table 1: Sample data from Reddit converted to sentence pairs for the ConVEx pretraining via the pairwise cloze task. Target spans in the input sentence are denoted with bold, and are "BLANKed" in the template sentence. recent work, we start with the 3.7B comments in the large Reddit corpus from 2015-2018 (inclusive) (Henderson et al., 2019a), filtering it to comments between 9 and 127 characters in length. This yields a total of almost 2B filtered comments.
Keyphrase Identification. Training sentence pairs are extracted from unlabeled text based on their shared keyphrases. Therefore, we must first identify plausible candidate keyphrases. To this end, the filtered Reddit sentences are tokenized with a simple word tokenizer, and word frequencies are counted. The score of a candidate keyphrase kp = (w 1 , w 2 , . . . , w n ) is computed as a function of the individual word counts: . (1) where |D| is the number of sentences used to calculate the word frequencies. This simple scoring function selects phrases that have informative low-frequency words. The factor α controls the length of the identified keyphrases: e.g., setting it to α = 0.8, which is default in our experiments later, encourages selecting longer phrases. Given a sentence, the keyphrases are selected as those unigrams, bigrams and trigrams whose score exceeds a predefined threshold. The keyphrase identification procedure is run for all sentences from the filtered Reddit sentences. At most two keyphrases are extracted per sentence, and keyphrases spanning more than 50% of the sentence text are ignored. Keyphrases that occur more than once in the sentence are also ignored.
Sentence-Pair Data Extraction. In the next step, sentences from the same subreddit are paired by keyphrase to create paired data, 1.2 billion examples in total, 2 where one sentence acts as the input 2 We also expand keyphrases inside paired sentences if there is additional text on either side of the keyphrase that is the  sentence and another as the template sentence (see Table 1 again). Table 2 summarizes statistics from the entire pretraining data preparation procedure.

The ConVEx Framework
We now present ConVEx, a pretraining and finetuning framework that can be applied to a wide spectrum of slot-labeling tasks. ConVEx is pretrained on the pairwise cloze task ( §2.1), relying on sentence-pair data extracted from Reddit ( §2.2). Similar to prior work (Coope et al., 2020), we frame slot labeling as a span extraction task: spans are represented using a sequence of tags. These tags indicate which members of the sequence are in the span. We use the same tag representation as Coope et al. (2020), which is similar to the standard IOB format: the span is annotated with a sequence of BEFORE, BEGIN, INSIDE and AF-TER tags. The ConVEx pretraining and fine-tuning architectures are illustrated in Figures 1a and 1b respectively, and we describe them in what follows.
ConVEx: Pretraining. The ConVEx model encodes the template and input sentences using exactly the same Transformer layer architecture (Vaswani et al., 2017) as the lightweight and highly optimized ConveRT sentence encoder : we refer the reader to the original work for all architectural and technical details. This model structure is very compact and resourcesame in both sentences. For instance, the original keyphrase "Star Wars" will be expanded to the keyphrase "Star Wars movie" within this pair: "I really enjoyed the latest Star Wars movie." -"We could not stand any Star Wars movie." efficient (i.e., it is 59MB in size and can be trained in 18 hours on 12 GPUs) while achieving state-ofthe-art performance on a range of conversational tasks Coope et al., 2020;Bunk et al., 2020). The weights in the ConveRT Transformer layers are shared for both sentences. 3 The 512-dimensional output representations from the ConveRT layers are projected down to 128-dimensional representations using two separate feed-forward networks (FFNs), one for the template and one for the input sentence. The projected contextual subword representations of the input sentence are then enriched using two blocks of self-attention, attention over the projected template sentence representations, and FFN layers. This provides features for every token in the input sentence that take into account the context of both the input sentence and the template sentence. A final linear layer computes Conditional Random Field (CRF) parameters for tagging the value span using the 4 BEFORE, BEGIN, INSIDE, and AFTER labels.
More formally, for each step t, corresponding to a subword token in the input sentence, the network outputs a 4×4 matrix of transition scores W t and a 4-dimensional vector of unary potentials u t . Under the CRF model, the probability of a predicted tag 3 The ConVEx pretraining also closely follows ConveRT's tokenization process: the final subword vocabulary contains 31,476 subword tokens plus 1,000 buckets reserved for out-ofvocabulary tokens. Input text is split into subwords following a simple left-to-right greedy prefix matching (Vaswani et al., 2018;, and we tokenize both input sentences and template sentences the same way. sequence y is then computed as: The loss is the negative log-likelihood, which is equal to the negative sum of the transition scores and unary potentials that correspond to the true tag labels, up to a normalization term. The top scoring tag sequences are computed efficiently using the Viterbi algorithm (Sutton and McCallum, 2012).
In addition to the CRF loss, an auxiliary dotproduct loss can be added. This loss encourages the model to pair template sentences with the corresponding (semantically similar) input sentences. Let f T i be the d-dimensional encoding of the beginning-of-sentence (BOS) token for the i th template sentence, and f I i be the encoding of the BOS token for the i th (corrresponding) input sentence. As the encodings are contextual, the BOS representations can encapsulate the entire sequence. The auxiliary dot-product loss is then computed as: where ·, · is cosine similarity and C is an annealing factor that linearly increases from 0 to √ d over the first 10K training batches as in previous work . The auxiliary loss is inspired by the dot-product loss typically used  in retrieval tasks such as response selection (Henderson et al., 2017). Note that this loss does not necessitate any additional model parameters, and does not significantly increase the computational complexity of the pretraining procedure. Later in §4 we evaluate the efficacy of pretraining with and without the auxiliary loss.
ConVEx: Fine-tuning. The majority of the computation and parameters of ConVEx are in the shared ConveRT Transformer encoder layers: they comprise 30M parameters, while the decoder layers comprise only 800K parameters. At ConVEx fine-tuning, the shared ConveRT transformer layers are frozen: these expensive operations are shared across slots, while the fine-tuned slot-specific models are small in memory and fast to run.
To apply the ConVEx model to slot-labeling for a specific slot, the user utterance is treated both as the input sentence and the template sentence (note that at fine-tuning and inference the user input does not contain any BLANK token) -see Figure 1b. This effectively makes the attention layers in the decoder act like additional self-attention layers. For some domains, additional context features such as the binary is_requested feature need to be incorporated (Coope et al., 2020): this is modeled through a residual layer that computes a term to add to the ConveRT output encoding, given the encoding itself and the additional features -see Figure 1b.
We again note that, except for the residual layer, no new layers are added between pretraining and fine-tuning; this implies that the model bypasses learning from scratch any potential complicated dynamics related to the application task, and is directly applicable to various slot-labeling scenarios.

Experimental Setup
Pretraining: Technical Details. The ConVEx parameters at pretraining are randomly initialized, including the ConveRT layers, and the model is pretrained on the pairwise cloze Reddit data. Pre-training proceeds in batches of 256 examples, 64 of which are randomly paired sentences where no value should be extracted, and the remaining being pairs from the training data. This teaches the model that sometimes no value should be predicted, a scenario frequently encountered with slot labeling. Table 3 provides a concise summary of these and other pretraining hyper-parameters.
Computational Efficiency and Tractability. ConVEx is pretrained for 18 hours on 12 Tesla K80 GPUs; this is typically sufficient to reach convergence. The total pretraining cost is roughly $85 on Google Cloud Platform. This pretraining regime is orders of magnitude cheaper and more efficient than the prevalent pretrained NLP models such as BERT (Devlin et al., 2019), GPT models (Brown et al., 2020), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), etc. The reduced pretraining cost allows for wider experimentation, and aligns with recent ongoing initiatives on improving fairness and inclusion in NLP/ML research and practice (Strubell et al., 2019;Schwartz et al., 2019).
Fine-tuning: Technical Details. We use the same fine-tuning procedure for all fine-tuning experiments on all evaluation data sets. It proceeds for 4,000 steps of batches of size 64, stopping early if the loss drops below 0.001. 4 The ConveRT layers are frozen, while the other layers are initialized to their pretrained values and optimized with Adam (Kingma and Ba, 2015), with a learning rate of 0.001 that decays to 10 −6 over the first 3,500 steps using cosine decay (Loshchilov and Hutter, 2017). Dropout is applied to the output of the Con-veRT layers with a rate of 0.5: it decays to 0 over 4,000 steps also using cosine decay. The residual layer for additional features (e.g., is_requested, token_is_numeric) consists of a single 1024-dim hidden layer. As we demonstrate later in §4, this procedure works well across a variety of data settings. The early stopping and dropout are intended to prevent overfitting on very small data sets.
Fine-tuning and Evaluation: Data and Setup. We rely on several diverse slot-labeling data sets, used as established benchmarks in previous work. First, we evaluate on a recent data set from Coope et al. (2020): RESTAURANTS-8K, which comprises conversations from a commercial restaurant book-ing system. It covers 5 slots required for the booking task: date, time, people, first name, and last name. Second, we use the Schema-Guided Dialog Dataset (SGDD) (Rastogi et al., 2019), originally released for DSTC8, in the same way as prior work (Coope et al., 2020), extracting span annotated data sets from SGDD in four different domains. The particulars of the RESTAURANTS-8K and DSTC8 evaluation data are provided in the appendix.
Similar to Coope et al. (2020), we simulate fewshot scenarios and measure performance on smaller sets sampled from the full data. We (randomly) subsample the training sets of various sizes while maintaining the same test set.
Furthermore, we also evaluate ConVEx in the 5-shot evaluation task on the SNIPS data (Coucke et al., 2018), following the exact setup of Hou et al. (2020), which covers 7 diverse domains, ranging from Weather to Creative Work (see Table 4 later for the list of domains). The statistics of the SNIPS evaluation are also provided in the appendix.
The SNIPS evaluation task slightly differs from RESTAURANTS-8K and DSTC8: we thus provide additional details related to fine-tuning and evaluation procedure on SNIPS, replicating the setup of Hou et al. (2020). Each of the 7 domains in turn acts as a held-out test domain, and the other 6 can be used for training. From the held-out test domain, episodes are generated that contain around 5 examples, covering all the slots in the domain. For each domain, we first further pretrain the Con-VEx decoder layers (the ones that get fine-tuned) on the other 6 domains: we append the slot name to the template sentence, which allows training on all the slots. This gives a single updated fine-tuned ConVEx decoder model, trained on all slots of all other domains. For each episode, for each slot in the target domain we fine-tune 3 ConVEx decoders. The predictions are ensembled by averaging probabilities to give final predictions. This helps reduce variability and improves prediction quality.
Baseline Models. For RESTAURANTS-8K and DSTC8, we compare ConVEx to the current bestperforming approaches from Coope et al. (2020): Span-BERT and Span-ConveRT. Both models rely on the same CNN+CRF architecture 5 applied on top of the subword representations transferred from a pretrained BERT(-Base/Large) model (Devlin et al., 2019) (Span-BERT), or from a pretrained ConveRT model . 6 Similar to Coope et al. (2020), for each baseline we run hyper-parameter optimization via grid search, evaluating on the dev set of  For SNIPS, we compare ConVEx to a wide spectrum of different few-shot learning models proposed and compared by Hou et al. (2020). 7 One crucial difference between our approach and the methods evaluated by Hou et al. (2020) is as follows: we treat each slot independently, using separate ConVEx decoders for each, while the their methods train a single CRF decoder that models all slots jointly. One model per slot is simpler, easier for practical use (e.g., it is possible to keep and manage data sets for each slot independently), and makes pretraining conceptually easier. 8 Evaluation Measure. Following previous work (Coucke et al., 2018;Rastogi et al., 2019;Coope et al., 2020), we report the average F 1 scores for extracting the correct span per user utterance. If the models extract part of the span or a longer span, this is treated as an incorrect span prediction.

Results and Discussion
Intrinsic (Reddit) Evaluation. ConVEx reaches a precision of 84.8% and a recall of 85.3% on the held-out Reddit test set (see Table 2 again), using 25% random negatives as during pretraining. The ConVEx variant without the auxiliary loss (termed no-aux henceforth) reaches a precision of 82.7% and a recall of 83.9%, already indicating the usefulness of the auxiliary loss. 9 These preliminary results serve mostly as a sanity check, suggesting ConVEx's ability to generalize over unseen Reddit data; we now evaluate its downstream task efficacy. 6 Coope et al. (2020) also evaluated an approach based on the same CNN+CRF architecture as Span-{BERT, Con-veRT} which does not rely on any pretrained sentence encoder, and learns task-specific subword representations from scratch. However, that approach is consistently outperformed by Span-ConveRT, and we therefore do not report it for brevity. 7 A full description of each baseline model is beyond the scope of this work, and we refer to (Hou et al., 2020) for further details. For completeness, short summaries of each baseline model on SNIPS are provided in the appendix. 8 Moreover, the methods of Hou et al. (2020) are arguably more computationally complex: at inference, their strongest models (i.e., TapNet and WPZ, see the appendix, run BERT for every sentence in the fine-tuning set (TapNet), or run classification for every pair of test words and words from the fine-tuning set (WPZ). The computational complexity of the ConVEx approach does not scale with the fine-tuning set, only with the number of words in the query sequence. 9 While we evaluate the two ConVEx variants also in the slot-labeling tasks later, unless noted otherwise, in all experi-   Table 4: F 1 scores on SNIPS 5-shot evaluation, following the exact setup of Hou et al. (2020). For an overview of the baseline models from Hou et al. (2020), see the original work and short summaries available in the appendix.  Evaluation on RESTAURANTS-8K and DSTC8.
The main respective results are summarized in Figure 2a and Figure 2b, with additional results available in the appendix. In full-data scenarios all models in our comparison, including the baselines from Coope et al. (2020), yield strong performance reaching ≥ 90% or even ≥ 95% average F 1 across the board. 10 However, it is encouraging that Conments we assume the use of the variant with the aux loss. 10 The only exception is Span-BERT's lower performance on the DSTC8 Homes_1 evaluation, see the appendix. In general, as shown previously by Coope et al. (2020) and VEx is able to surpass the baseline models on average even in the full-data regimes. Figure 2a and Figure 2b also suggest true benefits of the proposed ConVEx approach: the ability of ConVEx to handle few-shot scenarios well. The gap between ConVEx and the baseline models becomes more and more pronounced as we continue to reduce the number of annotated examples for the labeling task. On RESTAURANTS-8K the gain is still small when dealing with 1,024 annotated examples (+2.1 F 1 points over the strongest baseline), but it increases to +18.4 F 1 points when 128 annotated examples are available, and further to +31.2 F 1 points when only 64 annotated examples are available. We can trace a similar behavior on DSTC8, with gains reported for all the DSTC8 single-domain subsets in few-shot setups.
These results point to the following key conclusion. While pretrained representations are clearly useful for slot-labeling dialog tasks, and the importance of pretraining becomes increasingly important when we deal with few-shot scenarios, the revalidated here, conversational pretraining based on response selection (ConveRT) seems more useful for conversational applications than regular LM-based pretraining (BERT).
chosen pretraining paradigm has a profound impact on the final performance. The pairwise cloze pretraining task, tailored for slot-labeling tasks in particular, is more robust and better adapted to fewshot slot-labeling tasks. This also verifies our hypothesis that it is possible to learn effective domainspecific slot-labeling systems by simply fine-tuning a pretrained general-purpose slot labeler relying only on a handful of domain-specific examples.
SNIPS Evaluation (5-Shot). The versatility of ConVEx is further verified in the 5-shot labeling task on SNIPS following Hou et al. (2020)'s setup. The results are provided in Table 4. We report the highest average F 1 scores with ConVEx; Con-VEx also surpasses all the baselines in 4/7 domains, while the highest scores in the remaining three domains are achieved by three different models from Hou et al. (2020). This again hints at the robustness of ConVEx, especially in few-shot setups, and shows that a single pretrained model can be adapted to a spectrum of slot-labeling tasks and domains.
These results also stand in contrast with the previous findings of Hou et al. (2020) where they claimed "...that fine-tuning on extremely limited examples leads to poor generalization ability". On the contrary, our results validate that it is possible to fine-tune a pretrained slot-labeling model directly with a limited number of annotated examples for various domains, without hurting the generalization ability of ConVEx. In other words, we demonstrate that the mainstream "pretrain then fine-tune" paradigm is a viable solution to sequence-labeling tasks in few-shot scenarios, but with the condition that the pretraining task must be structurally wellaligned with the intended downstream tasks.
Next, we analyze the benefits of model ensembling, as done in the 5-shot SNIPS task, also on RESTAURANTS-8K. The results across different training data sizes are shown in Table 5. While there is no performance difference when a sufficient number of annotated examples is available, the scores suggest that the model ensembling strategy does yield small but consistent improvements in few-shot scenarios, as it mitigates the increased variance that is typically met in these setups.
Pretraining on CC100. We also test the robustness of ConVEx by pretraining it on another large Web-scale dataset: CC100  is a large CommonCrawl corpus available for English and more than 100 other languages. We use the English CC100 portion to pretrain ConVEx relying on exactly the same procedure described in §2, and then fine-tune it as before. First, its intrinsic evaluation on the held-out test set already hints that the CC100-based ConVEx is also a powerful slot labeller: we reach a precision of 85.9% and recall of 86.3%. More importantly, the results on RESTAURANTS8K, provided in Figure 3, confirm that another general-purpose corpus can be successfully used to pretrain the ConVEx model. We even observe slight gains on average over the Reddit-based model.
Inductive Bias of ConVEx. In sum, ConVEx outperforms current state-of-the-art slot-labeling models such as Span-ConveRT, especially in low-data settings, where the performance difference is particularly large. The model architectures of Span-{BERT, ConveRT} and ConVEx are very similar: the difference in performance thus arises mainly from the pretraining task, and the fact that Con-VEx's sequence-decoding layers are pretrained, rather than learned from scratch. We now analyse the inductive biases of ConVEx, that is, how the pretraining regime and the main assumptions affect its behavior before and after fine-tuning.
First, we analyze per-slot performance on RESTAURANTS-8K, comparing ConVEx (with aux) with Span-BERT and Span-ConveRT. The scores in a few-shot scenario with 64 examples are provided in Figure 4, and we observe similar patterns in other few-shot scenarios. The results indicate the largest performance gap for the slots first name and last name. This is expected, given that by the ConVEx design the keyphrases extracted from Reddit consist of rare words, and are thus likely to cover plenty of names without sufficient coverage in small domain-specific data sets. Nonetheless, we also mark prominent gains over the baselines achieved also for the other slots with narrower semantic fields, where less lexical variability is expected (date and people).
We can also expose ConVEx's built-in biases by applying it with no fine-tuning. Figure 5 shows the results with no slot-specific fine-tuning on RESTAURANTS-8K, feeding the user input as both the template and input sentence. We extract at most one value from each sentence, where the model predicted a value for 96% of all the test examples, 16% of which corresponded to an actual labeled slot, and 86% did not. The highest recalls were for the name slots, and the time slot, which correlates with the slot-level breakdown results from Figure 4. 11 11 The most frequent predictions from non-finetuned ConVEx that do not correspond to a labeled slot on RESTAURANTS-8K give further insight into its inductive biases. The top 10 extracted non-labeled values are in descending order: booking, book, reservation, a reservation, a table, indoors, restaurant, cuisine, outside table, and outdoors. Some of these could be modeled as slot values with an extended ontology, such as indoors or outdoors/outside table.

Conclusion
We have introduced ConVEx (Conversational Value Extractor), a light-weight pretraining and finetuning neural approach to slot-labeling dialog tasks. We have demonstrated that it is possible to learn domain-specific slot labelers even in low-data regimes by simply fine-tuning decoder layers of the pretrained general-purpose ConVEx model. The ConVEx framework has achieved a new leap in performance on standard dialog slot-labeling tasks, most notably in few-shot setups, by aligning the pretraining phase with the downstream fine-tuning phase for slot-labeling tasks.
In future work, we plan to investigate the limits of data-efficient slot labeling, focusing on one-shot and zero-shot setups. We will also apply ConVEx to related tasks such as named entity recognition and conversational question answering.

Ethical Considerations
To the best of our knowledge, the conducted work does not imply any undesirable ethical ramifications. By design and its uncontrollable nature, the Reddit data does encode a variety of societal, gender, and other biases; however, the models pretrained on the Reddit data are always fine-tuned for specific tasks using controlled data, and the Reddit-pretrained models are not used for any text generation nor full-fledged dialogue applications directly. The evaluation data used in this work have been collected in previous work following standard crowdsourcing and data annotation practices.

A Evaluation Data Statistics
For completeness, we provide the summary stats of the evaluation data used in our work: Table 6 shows the statistics of the RESTAURANTS-8K data set. The data set is available at: github.com/PolyAI-LDN/ task-specific-datasets. Table 7 shows the statistics of the DSTC8 data set. The data set is available at: github.com/PolyAI-LDN/ task-specific-datasets. Table 8 provides the statistics of the original SNIPS data set (Coucke et al., 2018), For further details on how the data set has been used in the 5-shot evaluation setup we refer the reader to the work of Hou et al. (2020). The data sets are available at:

github.com/AtmaHou/FewShotTagging
Recently, RESTAURANTS-8K and DSTC8 training and evaluation data have been made available via the integrated DialoGLUE benchmark (Mehri et al., 2020). For further details regarding the two evaluation sets, we also refer the reader to the original work (Rastogi et al., 2019;Coope et al., 2020).

B Baseline Models in the SNIPS Evaluation
This appendix provides a brief summary of the models from Hou et al. (2020) included in the SNIPS evaluation (Table 4) alongside ConVEx.
TransferBERT is a direct application of BERT (Devlin et al., 2019) to sequence labeling. It is first trained on the source domains. As the sequence labeling layers are domain-specific, these are then removed, and new layers are fine-tuned on the indomain training set (i.e., Hou et al. (2020) refer to it as the support set; this is exactly what we use for fine-tuning ConVEx).
SimBERT predicts sequence labels according to the cosine similarity between the representations from a BERT model of the input tokens with tokens in the support set, selecting the labels of the most similar labeled tokens.
WarmProtoZero (WPZ) (Fritzler et al., 2019) applies Prototypical Networks (Snell et al., 2017) to sequence labeling tasks. It treats sequence-labeling as word-level classification, and can either use randomly initialized word embeddings, or pretrained representations in the case of WPZ+BERT.
TapNet is a few-shot learning paradigm originally applied to image classification (Yoon et al., 2019). This works similarly to Prototypical Networks, but includes a task-adaptive network that projects examples into a space where words of differing labels are well separated.
Collapsed Dependency Transfer (CDT) is a technique for simplifying transition dynamics of a CRF, applied to both TapNet and WPZ. This represents the full transition matrix using shared abstract transitions, e.g. modeling transitions between any Begin tag to the Begin tag of any different slot using a shared probability.
Label Enhanced models, denoted L-WPZ and L-TapNet use the semantics of the label names themselves to enrich the word-label similarity modeling.

C Additional Results
The exact F 1 scores corresponding to the results plotted in Figure 2a and Figure 2b are provided in Table 9 and