From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding

The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.


Introduction
Digital conversational assistants have become an integral part of everyday life and they are available, e.g., as standalone smart home devices or in smartphones. Key steps in such task-oriented conversational systems are recognizing the intent of a user's utterance, and detecting the main arguments, also called slots. For example, for an utterance like "Add reminder to swim at 11am tomorrow", these key Natural Language Understanding (NLU), or Spoken Language Understanding (SLU) tasks are illustrated in Figure 1. As slots depend on the intent type, leading models typically adopt joint solutions (Chen et al., 2019;Qin et al., 2020).
Despite advances in neural modeling for slot and intent detection ( § 6), datasets for SLU remain limited, hampering progress toward providing SLU for many language varieties. Most avail-Add reminder to swim at 11am tomorrow intent: add reminder Figure 1: English example from XSID annotated with intents (add reminder) and slots ( todo , datetime ). The full set of languages is shown in Table 2. able datasets either support only a specific domain (like air traffic systems) (Xu et al., 2020), or are broader but limited to English and a few other languages (Schuster et al., 2019;Coucke et al., 2018). We release XSID, a new benchmark intended for SLU evaluation in low-resource scenarios. XSID contains evaluation data for 13 languages from six language families, including a very low-resource dialect. It homogenizes annotation styles of two recent datasets (Schuster et al., 2019;Coucke et al., 2018) and provides the broadest public multilingual evaluation data for modern digital assistants.
Most previous efforts to multilingual SLU typically focus on translation or multilingual embeddings transfer. In this work, we propose an orthogonal approach, and study non-English auxiliary tasks for transfer. We hypothesize that jointly training on target language auxiliary tasks helps to learn properties of the target language while learning a related task simultaneously. We expect that this helps to refine the multilingual representations for better SLU transfer to a new language. We evaluate a broad range of auxiliary tasks not studied before in such combination, exploiting raw data, syntax in Universal Dependencies (UD) and parallel data.
Our contributions i) We provide XSID, a new cross-lingual SLU evaluation dataset covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st  non-English auxiliary tasks for joint cross-lingual transfer on slots and intents: UD parsing, machine translation (MT), and masked language modeling. iii) We compare our proposed models to strong baselines, based on multilingual pre-trained language models mBERT (Devlin et al., 2019) and xlm-mlm-tlm-xnli15-1024 (Conneau et al., 2020) (henceforth XLM15), where the former was pre-trained on 12 of our 13 languages, and XLM15 on 5 of our 13 languages, thereby simulating a lowresource scenario. We also compare to a strong machine translation model (Qin et al., 2020). The remainder of this paper is structured as follows: we start by giving an overview of existing datasets and introduce XSID ( § 2), then we discuss our baselines and proposed extensions ( § 3). After this, we discuss the performance of these models ( § 4), and provide an analysis ( § 5) before we end with the related work on cross-lingual SLU ( § 6) and the conclusion ( § 7).

Other SLU Datasets
An overview of existing datasets is shown in Table 1. It should be noted that we started the creation of XSID at the end of 2019, when less variety was available. We choose to use the Snips (Coucke et al., 2018) and Facebook (Schuster et al., 2019) data as a starting point.
Most existing datasets are English only (all datasets in Table 1 include English), and they differ in the domains they cover. For example, Atis (Hemphill et al., 1990) is focused on airlinerelated queries, CSTOP (Einolghozati et al., 2021) contains queries about wheather and devices, and other datasets cover multiple domains.
Extensions of Atis to new languages are a main direction. These include translations to Chinese (He et al., 2013), Italian (Bellomaria et al., 2019, Hindi and Turkish (Upadhyay et al., 2018) and very recently, the MultiAtis++ corpus (Xu et al., 2020) with 9 languages in 4 language families. To the best of our knowledge, this is the broadest publicly available SLU corpus to date in terms of the number of languages, yet the data itself is less varied. Almost simultaneously, Schuster et al. (2019) provide a dataset for three new topics (alarm, reminder, weather) in three languages (English, Spanish and Thai). English utterances for a given intent were first solicited from the crowd, translated into two languages (Spanish and Thai), and manually annotated for slots. We follow these approaches, but depart from the Snips (Coucke et al., 2018) andFacebook (Schuster et al., 2019) datasets to create a more varied resource covering 13 languages, while homogenizing the annotations. XSID is a cross-lingual SLU evaluation dataset covering 13 languages from six language families with English training data. In what follows, we provide details on the creation of XSID ( § 2.2), including homogenization of annotation guidelines and English source training data ( § 2.3). For data statement and guidelines, we refer the reader to Section E, F and G in the Appendix.

XSID
As a starting point, we extract 400 random English utterances from the Snips data (Coucke et al., 2018) as well as 400 from the Facebook data (Schuster et al., 2019), which for both consist of 250 utterances from the test-split and 150 from the dev-split. We maintain the splits from the original data in  . We then translate this sample into all of our target languages. It should be noted that some duplicates occur in the random sample of the Facebook data. Since these instances naturally occur more often, we decided to retain them to give a higher weight to common queries in the final evaluation. 3 XSID includes Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st). 4 We have 13 evaluation languages with 800 sentences per language 5 resulting in a final dataset of 10,000 sentences. The language selection is based on availability of translators/annotators (most of them are co-authors of this paper, i.e. highly-educated with a background in NLP). We favor this setup over crowd-sourcing, i.e. quality and breadth in annotation and languages, and because for some languages crowd-sourcing is not an option. 6 For more information on the data and annotators we refer to the dataset statement in Appendix E.
The first step of the dataset creation was the translation. For this, the goal was to provide a fluent translation which was as close as possible to the original meaning. Because the data consists of simple, short utterances, we consider our annotator pool to be adequate for this task (even though they are not professional translators). The intents could easily be transferred from the English data, but the slots needed to be re-annotated, which was done by the same annotators.
Unfortunately, we were unable to retrieve annotation guidelines from the earlier efforts. Hence, as a first step of and as part of training, we derived annotation guidelines by jointly re-annotating dev and test portions of the English parts of the two data sources. These guidelines were revised multiple times in the process to derive the final guidelines for the whole dataset. Ultimately, the data collection process proceeded in two steps: translation of the data from English, and slot annotation in the target language. The aim of the guidelines was to generalize labels to make them more broadly applicable to other intent subtypes, and remove within-corpus annotation variation (see Appendix G for details). We calculated inter-annotator agreement for the guidelines; three annotators native in Dutch annotated 100 samples, and reached a Fleiss Kappa (Fleiss, 1971) score of 0.924, which is very high agreement. Common mistakes included annotation of question words, inclusion of locations in reminders, and the inclusion of function words in the spans. We updated the guidelines after the agreement study.
After these target phase annotation rounds, we finalized the guidelines, which are provided in the Appendix G and form the basis for the provided data. Table 2 provides an example annotation for all 13 languages for the example sentence "I'd like to see the showtimes for Silly Movie 2.0 at the movie house". These example translations illustrate not only the differences in scripts, but also differences in word order and length of spans, confirming the distances between the languages.

English Training Data
Because of our revised guidelines for the Facebook data and mismatches in granularity of labels between the Snips and Facebook data, we homogenize the original training data for both sources and include it in our release. For the Facebook data, this includes rule-based fixing of spans and recognition of the REFERENCE and RECURRING TIME labels. 7 For the Snips data, we convert a variety of labels that describe a location to the LOCATION label which is used in the Facebook data, and labels describing a point or range in time to DATETIME. After this process, we simply concatenate both resulting datasets, and shuffle them before training. The resulting training data has 43,605 sentences.

Models
Our main hypothesis is that we can improve zeroshot transfer with target-language auxiliary tasks. We hypothesize that this will help the multilingual pre-trained base model to learn peculiarities about the target language, while it is learning the target task as well. To this end, we use three (sets of) tasks with a varying degree of complexity and availability: 1) Masked Language Modeling (MLM): which is in spirit similar to pre-training on another domain (Gururangan et al., 2020), however, we learn this jointly with the target task to avoid catastrophic forgetting (McCloskey and Cohen, 1989); 2) Neural Machine Translation (NMT): where we learn English SLU as well as translation from English to the target language; and 3) Universal Dependency (UD) parsing: to insert linguistic knowledge into the shared parameter space to learn from syntax as auxiliary task besides learning the SLU task.
In the following subsections, we first describe the implementation of our baseline model, and the machine translation-based model, and then de-7 For more details on this procedure, we refer to scripts/0.fixOrigAnnotation.py in the repo. scribe the implementation of all auxiliary tasks (and the data used to train them). Auxiliary tasks are sorted by dataset availability (MLM NMT UD), where the first type can be used with any raw text, the second one needs parallel data -which is readily available for many languages as a byproduct of multilingual data sources -and the last one requires explicit human annotation. For South Tyrolean, a German dialect, no labeled target data of any sort is available; we use the German task data instead. We provide more details of data sources and sizes in Appendix B.

Baseline
All our models are implemented in MaChAmp v0.2 (van der Goot et al., 2021), an AllenNLPbased (Gardner et al., 2018) multi-task learning toolkit. It uses contextual embeddings, and finetunes them during training. In the multi-task setup, the encoding is shared, and each task has its own decoder. For slot prediction, a greedy decoding with a softmax layer is used, for intents it uses a linear classification layer over the [CLS] token (see Figure 2). 8 The data for each task is split in batches, and the batches are then shuffled. We use the default hyperparameters of MaChAmp for all experiments which were optimized on a wide variety of tasks (van der Goot et al., 2021). 9 The following models are extensions of this baseline.
In the NMT-transfer model ( § 3.2), the training data is translated before passing it into the model. For the auxiliary models ( § 3.3, 3.4 and 3.5), we simply add another decoder next to the intent and slot decoders. The losses are summed, and typically weighted (multiplied) by a factor which is given in corresponding subsections. We enable the proportional sampling option of MaChAmp (multinomial sampling α = 0.5) in all multi-task experiments, to avoid overfitting to the auxiliary task.

Neural Machine Translation with Attention (nmt-transfer)
For comparison, we trained a NMT model to translate the NLU training data into the target language, and map the annotations using attention. As opposed to most previous work using this method (

Masked Language Modeling (aux-mlm)
Previous work has shown that continuing to train a language model with an MLM objective on raw data close to the target domain leads to performance improvements (Gururangan et al., 2020). However, in our setup, task-specific training data and target data are from different languages. Therefore, in order to learn to combine the language and the task in a cross-lingual way, we train the model jointly with MLM and task-specific classification objective on target and training languages respectively. We apply the original BERT masking strategy and we do not include next sentence prediction following Liu et al. (2019a). For computational efficiency, we limit the number of input sentences to 100,000 and use a loss weight of 0.01 for MLM training.
Data For our masked language modeling objective, we use the target language machine translation data described above.

Machine Translation (aux-nmt)
To jointly learn to transfer linguistic knowledge from English to the target language together with the target task, we implement a NMT decoder based on the shared encoder. We use a sequenceto-sequence model (Sutskever et al., 2014) with a recurrent neural network decoder, which suits the auto-regressive nature of the machine translation tasks (Cho et al., 2014), and an attention mechanism to avoid compressing the whole source sentence into a fixed-length vector (Bahdanau et al., 2015). We found that fine-tuning the shared encoder achieves good performance on our machine translation datasets (Conneau and Lample, 2019; Clinchant et al., 2019), alleviating the need for freezing its parameters during training in order to avoid catastrophic forgetting (Imamura and Sumita, 2019; Goodfellow et al., 2014). Similar to MLM, we use 100,000 sentences, and a weight of 0.01. Data For this auxiliary task, we use the same data as for NMT-TRANSFER, described in detail above.

Universal Dependencies (aux-ud)
Using syntax in hierarchical multi-task learning has previously shown to be beneficial (Hashimoto et al., 2017;Godwin et al., 2016). We here use full Universal Dependency (UD) parsing, i.e., partof-speech (POS) tagging, lemmatization, morphological tagging and dependency parsing as joint  Data For each language, we manually picked a matching UD treebank from version 2.6 (Nivre et al., 2020) (details in the Appendix). Whenever available, we picked an in-language treebank, otherwise we choose a related language. We used size, annotation quality, and domain as criteria.

Experimental Setup
We target a low-resource setup, and hence all our experiments assume no target-language training nor development data for the target task. For all our experiments we use the English training from the Facebook and Snips data, and their English development sets (all converted to match our guidelines, see § 2). We use strict-span F1 score for slots (where both span and label must match exactly) and accuracy for intents as main evaluation metric as is standard for these tasks. 11 All reported results (including analysis and test data) are the average over 5 runs with different random seeds.
To choose the final model, we use the scores on the English development data. We are aware that this was recently shown to be sub-optimal in some settings (Keung et al., 2020), however there is no clear solution on how to circumvent this in a pure zero-shot cross-lingual setup (i.e. without assuming any target language target task annotation data).
We use multilingual BERT (mBERT) as contextual encoder for our experiments. We are also interested in low-resource setups. As all of our languages are included in pre-training of mBERT (except the de-st dialect), we also study XLM15 (XLM-MLM-TLM-XNLI15-1024), which in pre-training covers only 5 of the 13 XSID languages, to simulate further a real low-resource setup. Table 3 reports the scores on 13 XSID languages, for 2 tasks (slot and intent prediction) and 2 pre-  Figure 3: Performance increase over baseline for each auxiliary task with respect to the language distance (lang2vec) to English for mBERT (a) and XLM15 (b). It should be noted that the lines carry no meaning (i.e. we can not conclude performance based on language distance alone), and are shown to make trends visible.
trained language models. Languages are ordered by language distance, whenever available. Below we discuss the main findings per task.
Slots For slot filling, auxiliary tasks are beneficial for the majority of the languages, and the best performing multi-task model (aux-mlm) achieves +1.3 for mBERT and +7.7 for XLM15 average improvements over the baseline. By comparing mBERT and XLM15, there are significant performance drops for languages not seen during XLM15 pre-training, e.g., Danish (da) and Indonesian (id). This confirms that having a language in pre-training has a large impact on cross-lingual transfer for this task. For other languages involved in pre-training, both aux-mlm and aux-ud beat the baseline model. This supports our hypothesis that, after multilingual pre-training, auxiliary tasks (with token-level prediction both self-supervised and supervised) help the model learn the target language and a better latent alignment for cross-lingual slot filling.
Intents For intent classification the nmt-transfer model is very strong as it uses explicit translations, especially for languages not seen during pretraining. Using nmt as an auxiliary task does not come close, however, it should be noted that this only uses a fraction of the data and computational costs (see § 5.4

Test Data
Our main findings are confirmed on the test data (Table 4), where we also evaluate on MultiAtis++.
The nmt-transfer model perform superior on intents, whereas its performance on slots is worse. The best auxiliary setups are aux-mlm followed by aux-ud. Most significant gains with auxiliary tasks are obtained for languages not included in pre-training (XLM15). We believe there is a bug for aux-nmt with XLM15 (see also results in Appendix C), which we unfortunately could not resolve before submission time. Furthermore, we believe more tuning of machine translation can increase its viability as auxiliary task. In general our results on MultiAtis++ are lower compared to Xu et al. (2020), which is probably because they used a black-box translation model.

Effect of Language Distance
In Figure 3a we plot the performance increase over baseline for each auxiliary task with respect to the language distance when using mBERT. The results confirm that aux-mlm is the most promising auxiliary model, and clearly show that it is most beneficial for languages with a large distance to English. Figure 3b shows the same plot for the XLM15 models, and here the trends are quite different. First, we see that also for close languages, aux-ud as well as aux-mlm are beneficial. Second, the aux-ud model also performs better for the more distant languages.

Slot Detection Versus Classification
To evaluate whether the detection of the slots or the classification of the label is the bottleneck, we experiment with two varieties of the F1 score. For the first variant, we ignore the label and consider only whether the span is correct. We refer to this as unlabeled F1. For span detection, we allow for partial matches (but with the same label) which count towards true positives for precision and recall. We refer to this metric as loose F1.
Average scores with all three F1 scores for both pre-trained embeddings are plotted in Figure 4. One of the main findings is that nmt-transfer does very well on the loose F1 metric, which means that : Pearson correlations between target tasks performance (average of slots/intents) and 1) language distance as estimated by lang2vec, and 2) the auxiliary task. For nmt-transfer, the auxiliary task is the BLEU score of the machine translation, and for the baseline there is no auxiliary task.
it is poor at finding spans, instead of labeling them.
For the other models the difference between strict and unlabeled F1 is smaller, and both can gain approximately 5-10% absolute score for both types of errors. The only other large difference is for aux-nmt with XLM15, which makes more errors in the labeling (unlabeled F1 is higher). An analysis of the per-language results show that this is mainly due to errors made in the Kazakh dataset.

Correlation Auxiliary Task Performance
In Figure 5 we plot the absolute Pearson correlations between the auxiliary task (auxiliary task performance can be found in Appendix C) and the target tasks performance as well as between the target tasks and the language distance (from lang2vec, see Table 3). Here we use the average of slots/intents as score for the target task. The results show that when using only datasets from languages included in the pre-trained language model (i.e., mBERT), both language distance and auxiliary task performance are competitive predictors, whereas if also new languages are considered (XLM15) auxiliary task performance is clearly a stronger predictor.

Computational Costs
All experiments are executed on a single v100 Nvidia GPU. To compare computational costs, Table 5 reports the average training time over all languages for each of the models. The training time for nmt-transfer is the highest, followed by aux-nmt, then come the leaner auxiliary tasks. The inference Model Time (minutes) base 3 nmt-transfer 5,145 aux-mlm 220 aux-nmt 464 aux-ud 57 time of all the models for the SLU tasks is highly similar due to the similar architecture (except for nmt-transfer requiring fairSeq a-priori).

Case Study: Improving on de-st
Our lowest-resource language variety de-st is not included in either embeddings, and the performance on it is generally low. To mitigate this, we investigate whether a small amount of raw data could improve the aux-mlm model. We scraped 23,572 tweets and 6,583 comments from ask.fm manually identified by a native speaker, and used these as auxiliary data in the aux-mlm model. Although this data is difficult to obtain and contains a mix including standard German and others, it resulted in an increase from 49.9 to 56.2 in slot F1 scores and from 68.0 to 68.7 for intents, compared to using the German data in aux-mlm, thereby largely outperforming the baseline. This shows that even small amounts of data are highly beneficial in aux training, confirming results of Muller et al. (2021).

Related Work
For related datasets, we refer to § 2.1; in this section we will discuss different approaches on how to tackle cross-lingual SLU. Work on cross-lingual SLU can broadly be divided into two approaches, whether it is based mainly on parallel data or multilingual representations. The first stream of research focuses on generating training data in the target language with machine translation and mapping the slot labels through attention or an external word aligner. The translation-based approach can be further improved by filtering the resulting training data ( We propose a third, orthogonal line of research: joint target-language auxiliary task learning. We hypothesize that jointly training on target language auxiliary tasks helps to learn properties of the target language while learning a related task simultaneously. We frame masked language modeling, Universal Dependency parsing and machine translation as new auxiliary tasks for SLU. Some work on SLU showed that syntax in graph convolution networks is beneficial for slots (Qin et al., 2020). Contemporary work shows that highresource English data helps target language modeling in sequential transfer setups (Phang et al., 2020). We focus on non-English target data for joint SLU in a single cross-lingual multi-task model instead.

Conclusions
We introduced XSID, a multilingual dataset for spoken language understanding with 13 languages from 6 language families, including an unstudied German dialect. XSID includes a wide variety of intent types and homogenized annotations. We propose non-English multi-task setups for zero-shot transfer to learn the target language: masked language modeling, neural machine translation and UD parsing. We compared the effect of these auxiliary tasks in two settings. Our results showed that masked language modeling led to the most stable performance improvements; however, when a language is not seen during pre-training, UD parsing led to an even larger performance increase. On the intents, generating target language training data using machine translation was outperforming all our proposed models, at a much higher computational cost however. Our analysis further shows that nmt-transfer struggles with span detection. Given training time and availability trade-off, MLM multitasking is a viable approach for SLU.   Table 7 reports the data sources for the treebanks and the dataset sizes in number of words and sentences for both the treebanks and the parallel data.

C Scores on auxiliary tasks
Even though it was not our goal to improve the auxiliary tasks, performance on these can still be relevant to analyze whether there is any correlation to performance on the XSID tasks. In Table 8, we report the full results for all tasks. These are the scores the correlations of Figure 5 are based on.

D Standard Deviations
Standard deviations of our main results (Table 3) are shown in Table 9.

E XSID Data Statement
Following (Bender and Friedman, 2018), the following outlines the data statement for XSID: A. CURATION RATIONALE Collection of utterances intended to be used for digital assistants, generated by crowd-workers. We selected a random sample from two much larger sets (Coucke et al., 2018;Schuster et al., 2019) which we translated and annotated for slots and intents for the cross-lingual study of SLU.
C. SPEAKER DEMOGRAPHIC The original data is generated by crowd-workers and their demographics are unknown.
D. ANNOTATOR DEMOGRAPHIC Translators and annotators are the same people. Their age ranges from 20 to 57, with the majority being below 30, almost all annotators have a background in NLP (except for Chinese, and one inter-annotator for Dutch). Most annotators are currently doing a PhD, whereas there is one postdoc and two faculty.
E. SPEECH SITUATION The original data is generated in June 2017 (Coucke et al., 2018) and probably in 2019 (Schuster et al., 2019). The crowd workers were tasked to type sentences as how they would ask them in spoken form to a digitial assistant given a topic (intent).
F. TEXT CHARACTERISTICS The genre of the data is determined by the set of supported intents:

F Translation Guidelines
We aim to provide a fluent translation which is as similar (in meaning) as possible to the original. In some cases translations naturally have more distance, i.e. '7 pm' might translate to '7 in the evening' for languages in which there is no equivalent for 'pm'. The goal is to obtain sentences as they could possibly be used in the target language. Some general guidelines: • In general, named entities are not translated, with the exception of place names, like cities and countries. So names of playlists, persons etc. stay the same, and things mentioned between quotes as well. In languages where names are often transcribed differently (i.e. Serbian), this is done during annotation.
• In case of grammatical mistakes, they are kept (if possible) in the target translation.
• We keep capitalization and punctuation as in the original data (if they exist in the target language).
• Abbreviations not common in fluent discourse are expanded (e.g., Wed → "mercoledì", meds → medicin), also words that do not exist in the target language are paraphrased: 'whats the high tomorrow' → 'whats the maximum temperature tomorrow'.
• Some things can not be translated directly. For example, the phrase 'play me X' does not exist in many languages. E.g. 'me' might not be translated.
• For languages in which words are not separated by whitespace (i.e. Japanese and Chinese), we ask the translator to include whitespaces at word boundaries to simplify the annotation of the slots.

G Annotation Guidelines
This section describes our annotation guidelines. The aim of these guidelines was to make annotations homogeneous across earlier efforts for which guidelines were not available. Two major changes compared to the original annotations include: i) to generalize labels to make them more broadly applicable to other intent subtypes (an example is the recurrent datetime event from the Facebook data (Schuster et al., 2019), which was originally only applied to reminders and not to alarms, as discussed below); ii) we drop annotations of nouns as slots which are directly inferrable from the intent label (e.g. the 'reminder/noun' label was only applicable to nouns, but it was sometimes expressed as a verb and hence annotations were missing; as they are already annotated in the sentence-level intent slots, we drop such obvious slots).
For the annotation, we use Brat (Stenetorp et al., 2012), and provided the annotators with the gold English annotation (see Figure 6). English annotation was conducted by three annotators who discussed and resolved any initial disagreements. For annotation of the other languages, annotators were instructed to follow the English annotation when possible to maintain consistency.
Because no annotation guidelines were released with the original data, we provide guidelines for our re-annotation of the slots below. Examples are shown in Figure 7.
Spans We exclude function words in the beginning of an NP or VP, like 'for', 'from' in the examples above. An exception is when it is in the middle of a span as contiguous slots are preferred, like in example 2. This is different from previous releases of the data (Schuster et al., 2019), where datetime included 'for', 'at', 'to' and 'on'. We decided to drop them to make the annotations more homogeneous across slot labels, while capturing the core ('head spans') of the slots.
When two words of the same type occur sequentially, we annotate them as one span. This happens both for datetime (example 2, [5 to 6 am]) as well as reference (example 6, [all my]). Furthermore, we keep the annotation on the word-level to simplify processing. If only a part of a word belongs to a label, we annotate the whole word with that label.
Slot labels After our adaptations of the original labels, we annotate the following labels: • datetime: Indicating a date or a time.
Only concrete times are annotated (not, 'until deleted', 'what time' or 'when'), and times relative to other events are included (e.g. 'after work', 'later'). Non-concrete times, like 'until deleted' (example 9) are excluded.
• recurring datetime: a recurring event, can be used for alarms and reminders. This category prioritizes over datetime. Example: 'make alarm for [weekdays at 7 am]', if at least one recurring datetime exists in an instance, all datetimes should be annotated as recurring datetime (even if they are in different spans, see example 9).
• location: describes a location; can be a proper noun (like 'New York') or a nominal or adjective referring to a location ('my area' , 'out (outside)'). If a location is part of a reminder item, it is annotated as reminder/todo instead.
• reference: modifies the scope of an alarm or reminder, usually 'my' or 'all' used in front of the word 'reminder(s)' or 'alarm(s)'. Multiple sequential references are annotated as one span ('cancel [all my] alarms').
• reminder/todo: the item that should be reminded, the word 'to' should be excluded.
In special cases, we also apply this for alarms (see example 8).
• weather/attribute: A property that describes an aspect of the weather; e.g. 'cold', 'rain', 'temperature', 'severe'. Also includes weather-related items like 'coat' and 'umbrella' if used in relation to the weather.