Improving Spoken Language Understanding by Wisdom of Crowds

Spoken language understanding (SLU), which converts user requests in natural language to machine-interpretable expressions, is becoming an essential task. The lack of training data is an important problem, especially for new system tasks, because existing SLU systems are based on statistical approaches. In this paper, we proposed to use two sources of the “wisdom of crowds,” crowdsourcing and knowledge community website, for improving the SLU system. We firstly collected paraphrasing variations for new system tasks through crowdsourcing as seed data, and then augmented them using similar questions from a knowledge community website. We investigated the effects of the proposed data augmentation method in SLU task, even with small seed data. In particular, the proposed architecture augmented more than 120,000 samples to improve SLU accuracies.


Introduction
Recent advances in speech applications running on smartphones and smart speakers increase the importance of spoken language understanding (SLU). SLU is a task to predict an appropriate system function with its arguments, given a user request written or spoken in natural language. Various SLU benchmarks have been proposed: Air Travel Information Services (ATIS) (Dahl et al., 1994), restaurant information navigation , and other speech applications (Hori et al., 2019).
Adaptation of SLU to newly defined tasks is an important problem . The number of training data directly affects the SLU accuracy because most of the recent SLU systems are based on statistical machine learning approaches. Some existing work tackled this problem based on transfer learning (Wu et al., 2019), which uses a pre-trained model on different domain data. However, it is still challenging to make the SLU accurate with no or fewer data. Data augmentation has been applied to solve this problem, which generates pseudo training samples (Hou et al., 2018;Yoo et al., 2019). However, such methods often generate unnatural training samples that will decrease SLU accuracy. Another problem is the ambiguity of user utterances; it is difficult to generate such ambiguous examples with generative approaches. Some existing work tackled this problem by using paraphrasing models (Saha et al., 2018;Ray et al., 2018).
On the other hand, collecting text data from the Web is a widely used approach to building language models of automatic speech recognition systems (Bulyko et al., 2003;Sarikaya et al., 2005;Ng et al., 2005;Tsiartas et al., 2010). Web texts are expected to be more natural than generated pseudo sentences because users handcraft most of them. However, Web texts contain diverse domain texts; thus, we need some criteria to select appropriate texts to be used for the augmented training data. Test-set perplexity (Misu and Kawahara, 2006) or semantic similarity (Hakkani-Tur and Rahim, 2006;Yoshino et al., 2013) were widely used as criteria to select appropriate sentences for the training data augmentation. Such a selective approach using large-scale Web data has been applied to the data augmentation of dialogue systems (Du and Black, 2018;Henderson et al., 2019).
Crowdsourcing is a common way to collect human-annotated data at low-cost (Zhao et al., 2011;Mozafari et al., 2014). However, accurate SLU systems based on neural networks require a large-scale dataset. It is not easy to collect sufficient amount of training data only using crowdsourcing even if the cost of crowdsourcing is less than normal annotators.
In this paper, we utilize two sources of the "wisdom of crowd" for collecting a large-scale dataset to train the SLU system. We firstly collect a small amount of seed data by using crowdsourcing and then augment the dataset with similar texts extracted from the Web. As the target Web texts for the extraction, we focus on the online knowledge community website as another "wisdom of crowds." Online knowledge community websites often contain qualified question-style sentences. We choose sentences similar to the seed data from the qualified sentences and use them for SLU training. We conducted experiments to investigate the relationship between the accuracies of SLU systems and their training dataset augmented by sentences from the knowledge community website. As the result, over 120,000 sentences were extracted from the Web, and we got 35 points improvement in accuracy on domain selection of SLU.
2 Spoken language understanding based on crowdsourcing Our task is to develop an SLU system for a new domain with no available resources. We build a small amount of training data via crowdsourcing, then use the collected data as the seed of data augmentation. We describe the task definition of SLU and the seed data collection using crowdsourcing, in this section.

Task definition
The task of SLU is defined as the prediction of a dialogue frame F given a user request X with a word sequence x 1 , x 2 , ..., x n . The dialogue frame expresses user intent with information of domain, category, and query . Domain indicates a dialogue topic corresponding to the system function. We defined six domains in this work: "video", "weather", "news", "map", "shop" and "recipe". The category is a refined dialogue topic, which shares possible queries. For example, "movie" and "live" are defined as a part of categories belonging to the "video" domain. Query contains slotvalues required for predicted domain and category; in our case, "keyword", "date", "location", "state", "from", "to", "use" and "not use". We defined these domains, categories and queries by selecting typical applications of Yahoo!JAPAN speech assistant 1 .

Seed data collection via crowdsourcing
We used crowdsourcing to collect various user request expressions to the given user intention. We gave a description of the intention in three sentences to crowdworkers without any example queries for collecting diverse variations. The instruction and example intention follows: Instruction: You will see three sentences as your "intention," which describe your situation. Please input what you will say in that situation. Intention: You would like to watch cats on video. Your mother has the remote control of the TV in the living room. What will you ask your mother?
We prepared 120 intentions to collect variations on six domains: 24 for "video", 19 for "weather", 18 for "news", 18 for "map", 18 for "shop", and 19 for "recipe" domains. 100 crowd workers worked for one intention; finally, we collected 12,000 user utterance variations.  spoken language style, which are similar to user queries. It is reported that such website data is useful for dialogue systems (Yoshino et al., 2013); we expect that the data also will be useful for SLU. We calculate similarities between any seed queries and any question sentences extracted from the knowledge community website for finding the best alignment between augmented sentences and user intents.
The ith user intent f i has corresponding q i,j (1 ≤ j ≤ J), which is a possible query collected in crowdsourcing (J is the number of queries assigned to intent f i ). We calculate similarities between each q i,j and c k , which is a question sentence extracted from the knowledge community website, for finding similar sentenceĉ k to q i,j .ĉ k which will be assigned as an agumented training sample of f i .
Converting sentences to vector representations is an common approach to calculate similarities: vector space model (Salton et al., 1975), means of distributed representation of words (Mikolov et al., 2013;Le and Mikolov, 2014) and bi-directional long short-term memory neural networks (Bi-LSTM) (Cross and Huang, 2016;Yang et al., 2019). Recently, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) is known as a better sentence encoder, which is based on masked word prediction in surrounding sentences.
We used the BERT model trained using Japanese Wikipedia (Sakata et al., 2019) on the task of masked word prediction because we would like to extract semantically similar sentences to seed queries. The task of masked word prediction is based on the distributional hypothesis (Harris, 1954); thus, the resultant model trained in the task can embed semantically similar sentences into close points on the latent space. We note the vector of sentence q i,j as q i,j . Because both vectors q i,j and c k have the same vector size, we define their similarity as the cosine between them as,

Experiments in spoken language understanding
We investigated the effect of each data augmentation method in experiments in this section. We describe the SLU system that we used, experimental setting, and the results.

Spoken language understanding system
As described in Section 2.1, the task is estimating domain, category, and query that compose SLU output frames, given user requests. We used an incremental dialogue state tracker (Coman et al., 2019) 2 , which showed a good performance in DSTC2 shared task . We trained this incremental dialogue state tracker on our dataset and then used the final results of the tracker as our SLU results. Note that our defined SLU task is hierarchical; however, the used dialogue state tracker predicts domain, category, and slot independently.  Table 3: Examples of data augmentation by each method. Score means argmax j sim(q i,j , c k ).

Experimental setting
We compared three settings: without augmentation (single utterance sample is assigned to each user intent), using some crowdsourced data, and using some augmented data from the knowledge community website. We used Yahoo! QA website 3 as the target knowledge community website. We divided our dataset into three portions: training, development, and test sets. Note that samples with the same intent are put together in the same set. We used 2,000 samples as our development set and 2,600 samples as our test set, which was collected in crowdsourcing. We used two metrics according to existing work Coman et al., 2019): Accuracy and L2. Accuracy means the accuracy of predicted labels (larger is better), and L2 means the squared error from the one-hot representation of the correct answer (smaller is better). They were calculated for domain, category and query, respectively.

Experimental results
First, we show the data size for each setting on Table 1: w/o augmentation, using crowdsourced data (+crowd), and using extracted online knowledge community queries in addition to them (+crowd+Web). #t indicates the number of samples, and th indicates the sample selection threshold. We also show accuracy and L2 for each setting on Table 2. The result showed that scores were improved by using crowdsourced data, and there were additional improvements if we also used the data augmented from the knowledge community website. The method using data augmentation from Web showed large improvements, 19 points on development set, and 35 points on the test set, than using crowdsourced data. The threshold to select data from a knowledge community website is important. We investigated the threshold by grid search, as shown in Figure 1. These results indicate that we can select the threshold th as 0.84 from the development set, and the threshold achieved the best scores in the test set.
When we compare scores of each threshold to the score of the model only trained by crowdsourced data (w/o), their score decreased in some cases. From Table 1, scores were improved when the method can extract more samples from Web (th ≤ 0.86). This result indicates that some queries augmented from the knowledge community website do not contribute to the SLU accuracy; however, the method can improve scores with increasing the training data size even some of them are noisy. We show some augmentation examples in Table 3. We can see that the crowdsourcing (+Crowd) collected variations of a query in the same meaning with different expressions. In contrast, data augmentation results (+Web) contain some examples of different dialogue frames in similar expressions or different entities. This result suggests that the crowdsourcing is useful to collect query variations for SLU system, and the data augmentation from the knowledge community web site can improve the system with the amount of data. This result also indicates that all of the data on the Web is not useful for the training of statistical models; we have to establish a method to select appropriate data to be used for the training as mentioned in existing works (Misu and Kawahara, 2006;Yoshino et al., 2013;Akama et al., 2020).

Conclusion
In this paper, we proposed to use two sources of the "wisdom of crowds," crowdsourcing and knowledge community website, to augment the training data for SLU. We used BERT to calculate similarities between seed queries and queries extracted from a knowledge community website, as a converter from utterances to vectors. Experimental results showed that using both crowdsourcing and knowledge community website for data augmentation will improve the accuracy and robustness of the SLU system at low-cost.
The proposed architecture is evaluated only on written texts collected by crowdsourcing; thus, experimental evaluation with real recognized speech is essential for future work. In recent end-to-end SLUs, acoustic features are also important; however, it is difficult to collect such acoustic features from Web texts. One way to solve the problem is using a machine speech chain (Tjandra et al., 2020), which can generate pseudo acoustic features for the augmented query texts. Disambiguation to queries that have several meanings, and using other embedding methods such as RoBERTa (Liu et al., 2019) will be another future work.