Evaluation of Crowdsourced User Input Data for Spoken Dialog Systems

Using the Internet for the collection of data is quite common these days. This process is called crowdsourcing and enables the collection of large amounts of data at reasonable costs. While being an inexpensive method, this data typically is of lower quality. Filtering data sets is therefore required. The occurring errors can be classiﬁed into different groups. There are technical issues and human errors. For speech recording, technical issues could be a noisy background. Human errors arise when the task is misunder-stood. We employ several techniques for recognizing errors and eliminating faulty data sets in user input data for a Spo-ken Dialog System (SDS). Furthermore, we compare three different kinds of questionnaires (QNRs) for a given set of seven tasks. We analyze the characteristics of the resulting data sets and give a recommendation which type of QNR might be the most suitable one for a given purpose.


Introduction
Similar to research in other areas, Automatic Speech Recognition (ASR) systems and SDSs are facing the challenge how to get new training data, e. g., if there is the urge to cover new domains. Until several years ago, a common procedure was to record the required audio samples in an anechoic chamber and let experts (e. g., linguistics students) create the transcriptions. Although the data collected via this method is of high quality and can be used as a gold standard, researchers found that this approach is very time-consuming and results in quite little data related to the effort.
A few years ago, companies like Amazon Mechanical Turk started to offer so-called crowd-sourcing approaches, which means that Human Intelligence Tasks (HITs) are performed by a group of non-experts. Furthermore, these tasks are open calls and are assigned to the different crowdsource workers. Especially in industrial contexts, crowdsourcing seems to be the means to choose because development cycles are short and much data for ASR or SDS development can be generated right as it is needed, although the data collected needs to be checked for quality (Snow et al., 2008).
Our work analyzes crowdsourced data collected by the company Clickworker (Eskenazi et al., 2013, ch. 9.3.4). The data consists of user input to an in-car SDS, where the crowdworkers had to input one German utterance for each of the seven tasks, after which they had to transcribe the utterance themselves. This procedure was conducted for three different types of QNRs: pictures, semantics, and text. We show the differences among these QNRs as well as an overall quality evaluation of the collected data. For this, we make use of Natural Language Processing (NLP) tools.

Collection of Speech Data via Crowdsourcing
Crowdsourcing is a common part for collecting speech data nowadays. Eskénazi defines it as "a crowd as a group of non-experts who have answered an open call to perform a given task" (Eskenazi et al., 2013). Such a call will be advertised using special platforms on the Internet. Even though the participants are called "non-experts", they are skilled enough to perform these tasks. For collecting speech data, recording audio from a variety of different speakers helps to build better systems. Different speakers have different backgrounds. This is reflected in their speaking style and choice of words (Hofmann et al., 2012). These aspects are key for training a speaker-independent system. The choice of participants should reflect the target audience of the system. Using untrained workers is also cheaper than to hire experts.

Using ASR to Improve the Quality
Using an ASR system is an integral part of the collection of annotated speech data. Such systems are being used to optimize the collection methods. (Williams et al., 2011) have shown how to process HITs for difficult speech data efficiently. One approach is to first create a transcription and let crowdworkers correct it. Since humans are optimistic about correcting errors (Audhkhasi et al., 2012), a two step approach was proposed in (Parent and Eskenazi, 2010): let the workers first rate the quality/correctness of transcriptions and perform the corrections in a separate step. Another approach (van Dalen et al., 2015) deals with the combination of automatic and manual transcriptions. The errors produced are orthogonal: while humans tend to introduce spelling errors or skip words, automatic transcriptions feature wrong words, additional or even missing words. The usual approach for combining multiple transcriptions is ROVER (Fiscus, 1997) requiring an odd (typically three) amount of different transcriptions to be merged to break the tie. By the use of an ASR system, van Dalen et. al have shown that two manual transcriptions are sufficient to produce high quality.

Analysis of Crowdsourced User Input Data for Spoken Dialog Systems
In this section, we describe our approach to analyze the given corpus containing crowdsourced user input data for a goal-oriented in-car SDS.

The Corpus
The underlying German utterances for our analysis were collected by the German company Clickworker (http://www.clickworker.com/en). The participants were asked to invoke seven specific actions of an imaginary SDS deployed in a car. First, they got a task description, then they should record an audio of their input via a browser-based application on their own PC incl. microphone at home. After that the subjects were asked to transcribe their own utterance without hearing or seeing it again. In the following we describe the tasks 1, 4 and 5 exemplarily: In task 1, the imaginary user tells the system that he/she wants to listen to a certain radio station. Task 4 comprises the navigation to the address "Stieglitzweg 23, Berlin". In task 5, the user should call Barack Obama on his cell phone. There were three different QNRs, each one asking for all seven tasks named above. The QNRs differed in the way how the tasks were presented to the subject: by means of pictures, text, or semantics (see Figure 1). In the pictures QNR, the participants were shown one or more pictures depicting the task they should perform. Without any written text, this type of task description does not imply the use of specific terms. For type text, the participants were presented a few lines describing the situation they are in and the actions they should perform. This form of textual representation of the objects is more influencing towards the use of specific terms. In the semantics QNR, the participants are influenced the most, as they get presented a few keywords. This does not favor the use of different words. Each participant answered all seven tasks, but was presented only one type of task description across them. Each type of QNR was assigned to approximately 1,080 users resulting in 22,680 utterances (34.7 hours) in total, i. e., roughly 7,560 per QNR. Most subjects were between 18 and 35 years old, a smaller number of subjects was up to 55 years old. 90% of the subjects were between 18 and 35 years old, 8% between 36 and 55. The smallest group was aged over 55 which resulted in 2% of the data. Our participants were 60% men and 40% women.

Evaluation of Self-Entered Transcripts
To be able to tell the overall quality of the underlying corpus, we had to analyze the self-entered transcripts, too. For this purpose, we developed an NLP analysis chain which contains a large part of preprocessing (i. e. mainly cleaning the text) apart from the actual analysis. Concerning preprocessing, we first applied a basic tokenizer to split the punctuation marks from the rest of the text. Second, we went over the transcripts with a spell checker called Language-Tool (https://www.languagetool.org/). For all misspelled words, we checked whether it equals one of the predefined, special keywords which should be entered for the current task (e. g., "Michael Jackson", "Stieglitzweg"). If such a keyword was found, we processed the next word; if not, we checked which of the correct alternatives proposed by LanguageTool is most similar to one of the Figure 1: Instructions for task 4 in form of pictures, text and semantic entities words on our "synonymously used words" list by using the Levenshtein distance. Third, after deciding which spelling is the most appropriate one for each word, we store the corrected utterances and use them for further analysis. The latter included Part-of-Speech (POS) Tagging with the TreeTagger (Schmid, 1994) to investigate, which and how many different POS patterns, i.e. types of sentence patterns, occur in the corpus and how the QNRs differ from each other on this level. Further, we investigated the most frequent words used in each task, and how many words in total are used in a specific task and in a specific QNR. With our analysis, we provide answers to the following questions: (a) How large is the linguistic variation in the data set (on sentence and word level)? (b) Which pros and cons do the presented QNRs have? (c) Which QNR is the right one for a certain purpose? We present the results in Section 4.2.

Evaluation of Self-Recorded Audio Data
To determine the usability of the recordings, we compared the length of the recordings and analyzed them using an ASR system. Generally, we assume that most recordings are done appropriately and that their quality resembles a normal distribution. We conducted our analysis using the Janus Recognition Toolkit (JRTk) (Woszczyna et al., 1994) which features the IBIS decoder (Soltau et al., 2001). For each task, a certain answer length is expected. This length may vary, but a significantly shorter or longer audio file indicates an error. Whether due to a technically false recording setup or a misunderstanding of the task description, in both cases the recording needs to be discarded. Even if the length is within a suitable range, the transcription of the audio might be wrong. To see if the transcription matches the spoken words, we use JRTk to perform a forced alignment. We use a GMM/HMM-based recognizer for German with 6,000 context-dependent quintphone states for aligning a phoneme sequence to the audio using forced Viterbi alignment. If there is a mismatch between audio and transcriptions, there will be phonemes covering unusual long or short parts of the audio.

Results of the Audio Data Analysis
We divided the recordings into 21 different sets as there are 3 different QNRs and 7 tasks each. Table 1 shows a detailed overview of the recording lengths for different tasks. While task 4 produces the longest recordings, the semantics QNR produces the shortest recordings.
We also performed a forced Viterbi alignment: Figure 2 shows a histogram of the length of the longest phoneme per utterance used to indicate whether recording and transcription fit together. Since we do not have multiple transcriptions per utterance, we could not determine an optimal parameter set for identifying mismatched cases. But our preliminary results indicate that the longer the longest phoneme, the more likely a mismatch.

Results of the Transcript Analysis
Aiming at answering the questions posed in Section 3.2, we show the results of the transcript analysis in the following together with a short discussion. Tables 2 and 3 show the total number of utterances in the respective QNR data sets. The sec-  ond line of each table displays how many obligatory semantic entities were named, i.e. whether the two main content words (nouns in many tasks) were named, like "Sender, SWR3". The third line displays the number of insufficient utterances according to this criterion. Similarly, line four and five tell how many entities, which were actually asked for, i.e. all three (or four) items, were named and how many utterances were dismissed accordingly. As shown, the pictures QNR has to dismiss the most entities, while the semantics QNR dismisses the least. The values of the text QNR are in between the latter two QNRs. In total over all QNRs, we have dismissal rates of 17% and 37%. Table 4 displays the variance of words used for all three QNRs and across tasks 1-7. It is valid for all tasks that the semantics QNR has the lowest number of different words. This is probably caused by displaying three exact semantic items, inevitably being the corresponding words. For tasks 1-3 and 7, the text QNR has the highest number of different words, while the pictures QNR leads the number of different words in tasks 4-6.
The analysis of the most frequent POS sequences per QNR showed that in the semantics QNR, most people used a polite modal construction "Ich möchte den Sender SWR hören" (PPER VMFIN ART NN NN VVINF). In the other QNRs "Radio SWR3" (NN NN) is the most common one among finite and infinite constructions. Table 5 displays the most common sentence for each task. As you can see, there is a wide variety of linguistic patterns in each task.

Conclusion
We have presented various methods for evaluating the collected data set and that different types of QNRs lead to different styles in performing the tasks. With respect to the actual application sce-total number of utterances 7, 546 number of obligatory entities 5, 420 72% number of insufficient utterances 2, 126 28% number of asked for entities 3, 033 40% number of insufficient utterances 4, 513 60%    nario, the way in presenting the task to the participants has to be chosen in the correct manner. The semantics QNR is precise by using three semantic items and is the best choice for generating exact phrases; it generates very few utterance dismissals. But at the same time it displays the words themselves. To avoid the mere usage of these, one approach for future studies would be to display the semantic items in English. Simultaneously, the QNR would be easily reusable for the generation of data from other languages.
The pictures QNR is optimal to generate a very high linguistic variance in the data. The downside of this approach is the high dismissal rate, if one aims at generating specific utterances.
The text QNR is a good compromise between the latter two QNRs. Looking at the data analyzed in this work, the text QNR has a lower priming effect on formulations than the semantics QNR.