Leveraging User Paraphrasing Behavior In Dialog Systems To Automatically Collect Annotations For Long-Tail Utterances

In large-scale commercial dialog systems, users express the same request in a wide variety of alternative ways with a long tail of less frequent alternatives. Handling the full range of this distribution is challenging, in particular when relying on manual annotations. However, the same users also provide useful implicit feedback as they often paraphrase an utterance if the dialog system failed to understand it. We propose MARUPA, a method to leverage this type of feedback by creating annotated training examples from it. MARUPA creates new data in a fully automatic way, without manual intervention or effort from annotators, and specifically for currently failing utterances. By re-training the dialog system on this new data, accuracy and coverage for long-tail utterances can be improved. In experiments, we study the effectiveness of this approach in a commercial dialog system across various domains and three languages.


Introduction
A core component of voice-and text-based dialog systems is a language understanding component, responsible for producing a formal meaning representation of an utterance that the system can act upon to fulfill a user's request (Tur and De Mori, 2011). This component is often modeled as the combination of two tasks: intent classification (IC) -determining which intent from a set of known intents is expressed -and slot labeling (SL) -finding sequences of tokens that express slots relevant for the intent. As an example, consider the utterance Play Blinding Lights, which expresses the PlayMusic intent and the tokens Blinding Lights refer to the Song slot whereas Play expresses no slot.
As in many other language processing tasks, an important challenge arises from the fact that natural language allows one to express the same meaning in many different ways. In commercial systems at the scale of Apple's Siri, Amazon's Alexa or Google's Assistant, the variety of alternative utterances used to express the same request is immense, stemming from the scale of the system, the heterogenous language use among users and noise such as speech recognition errors (Muralidharan et al., 2019). In addition, the frequency distribution of alternative utterances follows a power law distribution, such that a system can handle a substantial part of the distribution by understanding just a few utterances, but needs to understand orders of magnitude more to also cover the long tail of the distribution. Although utterances from the tail occur rarely, humans have little difficulties understanding them, imposing the same expectation on a dialog system with true language understanding. However, the traditional way of building a language understanding component -supervised learning with manually annotated examples -is impractical to scale to that challenge as the necessary annotation effort becomes prohibitively expensive.
We propose to leverage implicit user feedback to reduce the need for manual annotation and thereby scale language understanding in dialog systems to more long-tail utterances. In this paper, we focus on cases where a user's utterance has not been correctly interpreted by the system, causing friction for the user, and the user gives implicit feedback by paraphrasing the utterance (see Figure 1). Such behavior is common in commercial systems, where many users make repeated attempts to receive a response. Predicted intents and slots of the eventually successful utterance can in these cases be used to infer labels This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.  Figure 1: During the interaction (left), the first and third user utterances, exemplary for noisy and long-tail utterances, are mis-interpreted by the system. The final, more common paraphrase succeeds. MARUPA (right) uses PD and FD to detect two paraphrase pairs with a friction-causing and a successful utterance (1+4 and 3+4). For both, the successful interpretation is projected onto the failed utterance (LP). Utterance 2 is not used. After aggregation across interactions, the new data is used to re-train the model.
for the friction-causing utterances. Our proposed method, MARUPA (Mining Annotations from User Paraphrasing), combines paraphrase detection (PD), friction detection (FD) and label projection (LP) models to automatically detect relevant paraphrasing behavior in interactions with a dialog system and to turn it into new, annotated utterances. The proposed data collection approach has several advantages: First, it is fully automatic, thus reducing effort and cost for annotation and making data collection scalable. Second, the friction-causing utterances that are followed by paraphrases tend to be the less frequent ones from the long tail, such that the collected data helps in particular to increase coverage and improve accuracy on that part of the distribution. And third, using user feedback to annotate data integrates seamlessly with supervised learning and is agnostic to the type of model being used for IC and SL. It requires only user feedback and predictions of the underlying model. It can also be applied repeatedly with improving underlying models and is essentially an extension of self-learning, as we discuss in Section 2.
We present experiments with MARUPA on anonymized data from a commercial dialog system across many domains and three different languages. In addition, we also report component-wise evaluations and experiments on public datasets where possible. We observe improvements of IC and SL performance when using the collected data for training, demonstrating that the approach can effectively collect useful annotated utterances automatically from user paraphrasing behavior.

Related Work
Intent classification (IC) and slot labeling (SL) have been studied for several decades as fundamental building blocks of task-oriented dialog systems, dating back at least to the creation of the ATIS corpus (Price, 1990). In recent years, progress has been been made by applying deep learning, such as using recurrent networks (Mesnil et al., 2013), jointly learning both tasks (Zhang and Wang, 2016;Hakkani-Tür et al., 2016) and improving knowledge-sharing between them (Goo et al., 2018;E et al., 2019).
Methods to leverage user feedback for IC and SL have received less attention in the past, presumably because such work is difficult to perform without access to a deployed dialog system. Most similar to ours is the work of Muralidharan et al. (2019), which proposes to use positive and negative user feedback to automatically turn cheap-to-obtain coarse-grained annotations for SL into more fine-grained ones. Coarse and fine-grained annotations are used together in a multi-task learning setup. They apply that idea to playback requests for music, and rely on whether a user kept a song playing for more than 30 seconds as the feedback. The main differences to this work are that MARUPA is not music-specific but domain-agnostic, that it does not require a coarse-grained annotation to start with and that it integrates with existing IC/SL models more seamlessly, as it works in the same label space and requires no multitask learning. We also report experiments across many domains and different languages.
Other related work is presented by Ponnusamy et al. (2020), who describe an online method to make use of paraphrasing behavior: They use absorbing Markov chains to distill common successful paraphrase pairs from paraphrasing behavior across many users and use them to re-write utterances at runtime; similar to query-rewriting in search engines. While focusing on the same user behavior, our methods are mostly orthogonal to that work, as it focuses on simplifying the utterance given a fixed IC/SL model while we aim to improve the IC/SL model. In less similar work, Qiu et al. (2019) propose a method to leverage paraphrase relationships between utterances to improve IC. Different to ours and the aforementioned methods, they try to find pairs of paraphrases among large, unstructured collections of unlabeled utterances rather than relying on the dialog flow and feedback given by users. More over, Yaghoub-Zadeh-Fard et al. (2019) study utterance paraphrases collected via crowdsourcing and develop methods to address common paraphrasing errors that occur in such crowdsourced data. Since our method relies on naturally occurring paraphrases, it does not face these challenges.
Beyond the IC and SL tasks of task-oriented dialog, incorporating user feedback is much more common for chit-chat dialog systems and for learning dialog management policies. For the latter, reinforcement learning based on simulated or real user feedback has been studied for a long time (Levin et al., 2000;Schatzmann et al., 2006;. Similarly, chit-chat dialog systems are usually trained end-to-end with reinforcement learning and estimated user satisfaction has been part of proposed reward functions (Serban et al., 2018;. Hancock et al. (2019) propose a dialog agent that incorporates feedback via supervised multi-task learning and gathers the feedback by learning to explicitly ask a user for it. Through online learning, the feedback is incorporated while the system is actively used. Another line of work focuses on the detection of user feedback itself, rather than also trying to use it. Several models have been proposed to find positive and negative feedback utterances or predict a user's emotion (Wang et al., 2017;Hashimoto and Sassano, 2018;Eskenazi et al., 2019;Ghosal et al., 2019).
Our work is also related to the semi-supervised learning approach known as self-learning or pseudolabeling, in which models are trained on predictions that a previous version of the model made on unlabeled data. This idea has been successfully applied to a wide range of language tasks, e.g. named entity recognition (Collins and Singer, 1999), word sense disambiguation (Mihalcea, 2004) and parsing (McClosky et al., 2006). Successful applications are also known from computer vision, of which  gives an overview. For task-oriented dialog systems,  report substantial error reductions using self-learning to bootstrap new features. While the use of MARUPA-labeled data for training is at its core self-learning, our approach goes further by leveraging user feedback in the form of paraphrasing and friction as additional signals to guide the self-learning process.

MARUPA
Before describing our approach, we start defining important terminology. An utterance u is a sequence of tokens, directed from a user to a dialog system. Annotated utterances are triples (u, i, s) with an intent label i and slots s. Each slot s i has a slot name and a slot value, the latter being a subset of tokens from the utterance. As an example, consider u = "play blinding lights by the weekend" with intent i = PlayMusic. Slots can be written as a mapping s = {Artist : "the weekend", Song : "blinding lights"} or as token-level labels "O B-Song I-Song O B-Artist I-Artist" using BIO-encoding.
MARUPA relies on paraphrase detection (PD), friction detection (FD) and label projection (LP) components as illustrated in Figure 1. Given an interaction between a user and the dialog system, consisting of a sequence of user utterances u along with intent and slots (î,ŝ) predicted by the dialog system, we find all pairs (u j , u k ), where u j occurs before u k , that satisfy the following condition: It captures pairs that occur within a maximum time distance δ, that are paraphrases according to a classifier P D(u j , u k ) ∈ {0, 1} and have the desired friction classifications F D(·) ∈ {0, 1}, i.e. u j causing friction but not u k . The threshold θ is imposed on the label projection score LP (u j , u k ) ∈ [0, 1] defined later in Equation 5 to ensure that unsuccessful projections are discarded. Each pair that satisfies Equation 1 yields a new training example for IC and SL, which are finally aggregated. In the following, we describe each of the involved components in detail, starting with PD (Section 3.1), then FD (Section 3.2), LP (Section 3.3) and aggregation (Section 3.4).

Paraphrase Detection
Sentential paraphrases are commonly defined as different sentences with the same, or very similar, meaning (Dolan and Brockett, 2005;Ganitkevitch et al., 2013). Paraphrase detection is the corresponding binary classification task on sentence pairs. In this work, we apply it to pairs of dialog system utterances.
Compared to paraphrase detection in general, our context has one advantage: The notion of meaning, on which the definition of paraphrase relies, can be formalized precisely with regard to the downstream task, i.e. by relying on intents and slots. If two utterances are annotated with the same intent and the same set of slots, we consider them to be paraphrases. On the other hand, the context also brings several challenges. While paraphrases can be very different in its surface form, non-paraphrases can be extremely similar, such as set volume to 3 and set volume to 6. In addition, speech recognition and endpointing errors make the utterances noisy and the intended meaning often hard to recover. Crucially, indomain paraphrase detection data is needed to address those challenges, but to the best of our knowledge, no such dataset exists. To overcome this issue, we propose an automatic corpus creation method.
Corpus Creation To overcome the lack of in-domain paraphrase datasets, we propose a corpus creation approach that automatically derives paraphrase pairs from annotated utterances as used for IC and SL. Such data is necessarily available in our context and using it ensures that the paraphrase pairs resemble the domain of interest. The core idea is that two utterances with the same signature and same slot values but different carrier phrases must be paraphrases. We define the signature of an annotated utterance to be the set of its intent and slot names, e.g. {PlayMusic, Artist, Song}. The carrier phrase of an utterance is the utterance without slot values, e.g. "play $Song by $Artist".
Algorithm 1 shows the corpus creation procedure. Given annotated utterances and a target size, it repeatedly samples signatures occurring in the utterances. To create a positive paraphrase example, we sample two different carrier phrases for that signature and inject the same sampled slot values into them. As an example, consider the signature {PlayMusic, Artist, Song}. We obtain a positive pair by sampling slot values {Artist: "the weekend", Song: "blinding lights"} and the following two carrier phrases: play $Song by $Artist → play blinding lights by the weekend (2a) $Artist $Song please → the weekend blinding lights please (2b) The more interesting part is how negative pairs are sampled, which are crucial to make the task nontrivial for paraphrase models. We sample a second, very similar signature from the k nearest signatures in lines 10 and 11 using Jaccard distance between signatures σ a and σ b : To continue with the previous example, consider the second signature {PlayMusic, Song, Room} with a Jaccard distance of 0.33 that leads, when sampling carrier phrases, to a negative utterance pair:

Algorithm 1 Paraphrase Corpus Creation
Input: Set D of triples (u, i, s), size n, nearest k Output: Set P of triples (u, u , l) with l ∈ {0, 1} 1: P = ∅ 2: for n/2 do 3: Sample a signature σ from all in D.

4:
Sample a carrier phrase c for σ. Sample another c pos = c for σ.

12:
Sample a carrier phrase c neg for σ neg .

13:
Inject values from 7, plus samples for new slots in c neg , into c, c neg to obtain u, u neg . 14: Add (u, u neg , 0) to P. 15: end for 16: Return P.
Model For paraphrase detection, we rely on BERT (Devlin et al., 2019), a pre-trained neural model that reached state-of-the-art performance on the common paraphrase detection benchmarks QQP and MRPC. We combine paraphrase pairs into sequences as in the BERT paper, e.g. "[CLS] play blinding lights [SEP] blinding lights please [SEP]". The model encodes the sequence and feeds the output through a 256-d ReLU layer followed by a binary classification. The whole model is fine-tuned on our paraphrase corpus to ensure it addresses the unique challenges of dialog. To support different languages, we use a multilingually pre-trained BERT model (Devlin et al., 2019) and fine-tune it in the target language.

Friction Detection
Friction occurs for a user of a dialog system if their utterance is not correctly interpreted and the system does not show the desired reaction. It is thus closely related to work on user satisfaction and user feedback detection discussed in Section 2. While friction can in general be caused by many parts of a dialog system -speech recognition, entity resolution, request fulfillment -we are particularly interested in friction caused by IC and SL for our data collection. Therefore, given a small amount of hand-labeled friction examples, we train a model that can automatically detect such friction cases.
Given an utterance u and predictionsî andŝ, we model friction detection as a binary classification task. We use a linear SVM for classification. Our set of binary features (see Table 1) captures the utterance itself, the intent and slots predicted for it, utterance-level confidence scores as well as status codes received from downstream components that try to act upon the provided interpretation. Since these features capture only the utterance and how the dialog system reacts to it, but no additional feedback a user might give afterwards, this modeling approach relies on the friction detector to become aware of current limitations of the underlying model and where they are, rather than relying on explicit feedback. In fact, we found that confidences and status codes from fulfillment components are strong predictors for that. Although other work found it beneficial to include the user's next utterance in similar tasks (Eskenazi et al., 2019), we observed no improvements in classification performance when doing so.

Label Projection
Once we found a paraphrase pair (u, u ) with predictionsî,ŝ andî ,ŝ , of which the first caused friction but not the second, we want to use the successful interpretationî ,ŝ with the utterance u as a new training example. While this is trivial for intents, it is more intricate for slot labels as they are per token, which differ between u and its paraphrase u . See Figure 2 for an example. Nevertheless, many tokens typically overlap, which motivates our token alignment-based approach inspired by Fürstenau and Lapata (2009).
For each source-target token pair (u j , u k ), we first compute a similarity sim(u j , u k ) ∈ [0, 1]. Based on these pairwise similarities, we then consider all alignments of the n source tokens in u to the m targets  Figure 2: Alignment-based slot label projection.
in u and score them by the average similarity of the chosen alignments where x jk is a binary decision variable representing the alignment. We enforce that all source tokens must have an alignment, only one source can align to one target, but allow alignment to a virtual empty target as a fallback. We experimented with different similarities and optimization methods. For MARUPA, we settled to using as similarity the inverse of character-level Levenshtein edit distance normalized by length, which works well to identify identical tokens or slight morphological variations. The optimization is done greedily, choosing the most similar target for each source from left to right. As we show in experiments in Section 5, this choice offers a good trade-off of alignment accuracy and runtime. Finally, given the alignment, we project slot labelsŝ to the target tokens accordingly (see Figure 2).

Aggregation
Once PD and FD identified relevant utterance pairs and LP projected slot labels, we obtain new annotated training examples. However, due to noise in the source data and noise introduced by the components of our method, they can be of varying quality. We therefore introduce a final aggregation step that takes examples collected from many user interactions and filters them by consistency. Only if we create an annotation for a specific utterance repeatedly and consistently, we keep the examples. In addition, we also test the effect of adding the training examples to the IC/SL model in terms of accuracy changes on a development set and filter out examples for intent/slot-combinations with accuracy degradations. Note also that the conditions in Equation 1 are conjunctive, allowing us to enforce them in an arbitrary order on potential utterance pairs. In practice, this enables enforcing the least resource intense checks first, which will reduce the amount of pairs that have to be processed by more time-consuming methods such as paraphrase classification with the fine-tuned BERT model.

Experimental Results
In this section, we present experiments with MARUPA in the context of a commercial dialog system across multiple languages. They demonstrate that the newly collected training examples can help to improve the IC and SL performance, in particular on long-tail utterances.

Setup
We experiment with German, Italian and Hindi utterances. In each case, we use the production IC/SL model of the dialog system as the baseline and compare it against a model re-trained after adding our additional training examples. We report relative changes of a combined IC/SL error rate. In each language, the train and test data contains several hundred different intents and slots.
To collect new training examples, we apply MARUPA to a sample of anonymized user interactions with the dialog system. We set δ = 30s and θ = 0.4 in Equation 1. For paraphrase detection, we use Algorithm 1 to derive a corpus of 300k paraphrase pairs with k = 30 from the training data of the baseline model, and split it into a train, validation and test split (70/10/20). As discussed in Section 3.1,  we fine-tune BERT-Base Multilingual Cased 1 . We use a batch size of 64 and dropout of 0.2. The model is trained using Adam with a learning rate of 5·10 −5 and with early stopping. For friction detection, we use the model described in Section 3.2 and tune the SVM's regularization constant via grid search. Labels are projected as described in Section 3.3. Finally, we aggregate the collected examples by keeping only utterances that occur at least 10 times and for which the same annotation was created in at least 80% of the cases. In addition, we filter out collected examples belonging to intent/slot-combinations for which the error rate on the IC/SL development set increases after retraining the model. Table 2 shows the results of our experiment for German, Italian and Hindi. We evaluate the model's error rate on its main test set, obtained via manual annotation, and on a set of held-out MARUPAcollected examples (20%). For the former, we break down results by utterance frequency. 2 While the overall change on the main test set (Total) is negligible, the break down reveals that for low-frequency utterances the new data leads to error rate reductions of up to 3%. As desired, this comes with no negative effects for high-frequency utterances. Even stronger error reductions can be seen for the heldout MARUPA examples, on which the baseline's error rate is by design high. We observed that many collected utterances are so rare that they are not captured even by the lowest-frequency bin of the main test set, making it difficult to assess the full impact of MARUPA. Across languages, we observe that for newer languages like Hindi, less useful examples can be collected and their quality is lower, which is because fewer user interactions are available and the underlying IC/SL is less mature. Overall, the results demonstrate that MARUPA can improve accuracy on long-tail utterances.

Analysis
In addition to the previous section, which presented our main experimental results, the following sections show further experiments to provide additional insights into our proposed method.

Application to Public Datasets
Since the core of our approach is to leverage user paraphrasing behavior, it is difficult to conduct experiments on academic datasets for IC and SL, which to the best of our knowledge contain neither examples for friction nor paraphrasing. However, we try to replicate our application scenario as closely as possible.
Setup We use SNIPS (Coucke et al., 2018), a common benchmark with 14,484 English utterances annotated for IC and SL. We sample a subset of 10% of the training set as labeled data and treat the remaining part as unlabeled. On the unlabeled part, we create paraphrase pairs re-using ideas from Section 3.1: We group all utterances by signature and compute the frequency of unique carrier phrases. Per signature, we use all utterances from the 70%-head of the carrier phrase distribution, and for each, create a paraphrase by sampling another carrier phrase from the remaining 30%-tail and inject the same slot values. This process turns the unlabeled data into pairs of paraphrases, each pair consisting of a  Table 3: Experiments with unlabeled head/tail-paraphrase pairs created for SNIPS (English), comparing label projection to self-labeling and selection by friction to selection by confidence. Metrics are accuracy (IC), slot-based F1 (SL) and full frame accuracy (IC+SL). All results are averages of 10 re-runs.
less common (tail) and a more common (head) utterance, similar to our real-world friction scenario. For SNIPS, the process yields 1,309 labeled utterances and 7,944 unlabeled paraphrase pairs. Note that we can only apply FD and LP of MARUPA, but not PD, as paraphrase pairs are already given in this setup. For IC and SL, we train a joint neural model that encodes utterances with ELMO embeddings (Peters et al., 2018) and then feeds them through a BiLSTM. Slot labels are predicted from each hidden state and intent labels from the mean of them. We use 1024-d ELMO embeddings, a singlelayer 200-d BiLSTM and dropout of 0.2 after both. The model is trained with Adam, batch size 16 and early stopping. For evaluation (and stopping) we use the original test (and validation) set of SNIPS.
Results Table 3 shows results for various ways to leverage the unlabeled paraphrase pairs, including using FD and LP of MARUPA (see row 5). The baseline (1) uses only labeled data, and all other methods make use of this model's predictions on the unlabeled pairs. The most simple approach of ingesting all self-labeled data back into training, for just head utterances (2.1) or head and tail (2.2), already boosts performance on both tasks. However, when using label projection from self-labeled head utterances to tail utterances (2.3), the performance is even better, demonstrating the effectiveness of label projection.
To apply FD in this setup, we train the IC/SL model with cross-validation on the labeled data and then train the detector on the out-of-fold predictions. 3 If we apply label projection only to paraphrase pairs selected by FD (3.2), i.e. pairs with friction for the tail but not for the head, we see that projection outperforms self-labeling (3.1) on these cases even more, as the tail's self-label is expected to cause friction. Finally, we compare selecting paraphrase pairs by FD to selecting based on the baseline model's confidence, a common approach for self-supervised learning (see Section 2). We tune a confidence threshold on the validation set. 4 Selecting with FD is superior both when using only head utterances (4.1 vs. 4.2) and when including tail utterances with label projection (4.3 vs. 5). Thus, we conclude that our self-learning approach based on friction detection is effective also beyond our exact application scenario.

Quality of Derived Paraphrase Corpora
Next, we give more insight into the effectiveness of the proposed paraphrase corpus creation. We include public datasets, as paraphrase detection was not included in the previous experiment.    Setup We use our internal IC/SL training data for German, Italian and Hindi as well as the commonly used SNIPS and ATIS (Price, 1990) corpora (both English) to derive paraphrase pairs and fine-tune mBERT models as outlined in Section 4.1 on them. For each, we report accuracy for paraphrase detection evaluated on the 20% test split of the paraphrase pairs. For ATIS and SNIPS, due to their small size, we derive corpora with 10k pairs and k = 10. As baselines, we include simple threshold-based classifiers using character-level edit-distance or token-level Jaccard similarity and a linear SVM using binary uniand bigram features, and-and xor-combined between the two utterances.
Results Table 4 shows paraphrase detection results. We observe that the simple baselines relying on surface-level similarities perform much worse than BERT, which demonstrates that our approach of selecting negative examples with high similarity (see Section 3.1) is effective for creating a challenging corpus. We saw in preliminary experiments that sampling negative pairs fully randomly makes the baselines perform much better. Furthermore, we observe that the trained SVM performs much better than the simple baselines, but using a fine-tuned BERT model substantially improves upon it. Finally, this experiment also shows that our approach can be applied across different languages and domains.

Analysis of Friction Detection
For friction detection, we report classification performance for the task in Table 5, evaluated on a held-out set of the same data the models are trained on for MARUPA. Performance, and in particular precision, is generally high across languages, showing that the model can reliable find many friction cases. That also confirms that the approach is largely language independent and works for any of the tested ones. An interesting trend is that friction detection, using the set of features proposed in this work, tends to become more difficult the more mature the underlying IC/SL model for the language is, indicating that the causes of friction most easy to detect are also being resolved the earliest.
In Table 6, we show ablation results for the friction detector trained on German utterances. Removing any of the features decreases F1-scores by at least 2 points, showing that they all contribute to the task. Among them, the status codes received from the fulfillment component are by far the most useful features. This underlines the point made earlier that the dialog system is already aware of many friction cases and does not necessarily require explicit or implicit feedback from the user to detect friction -but, we do require feedback in terms of paraphrasing to go further and avoid the friction in the future.  Table 7: Label projection accuracy and runtime across similarity measures and optimization strategies.

Quality vs. Runtime Trade-Offs in Label Projection
Finally, Table 7 shows different instantiations of our label projection approach, tested on a dataset of 11k German utterance pairs for which reference projections have been derived from gold IC/SL annotations. We report whether the projection matches the reference completely (Exact Match) and F1-scores over token-level matches (Token F1). We observed that using word embeddings to compute similarities provided no improvements, as cases where that would be needed, i.e. aligning synonyms, are rare in the data. Exact optimization using linear programming, on the other hand, does improve the projection regardless of similarity being used, but comes at a large increase in runtime. Trading off quality and runtime, we therefore rely on greedy optimization with character-level edit-distance in MARUPA.

Conclusion
We proposed MARUPA, an approach to collect annotations for friction-causing long-tail utterances automatically from user feedback by using paraphrase detection, friction detection and label projection. We demonstrated that this form of user feedback-driven self-learning can effectively improve intent classification and slot labeling in dialog systems across several languages and on SNIPS. In the future, we plan to integrate more advanced aggregation methods to reduce noise and to more closely study the effect of the feedback loops that emerge when our data collection and re-training are applied repeatedly.