Marrying Up Regular Expressions with Neural Networks: A Case Study for Spoken Language Understanding

The success of many natural language processing (NLP) tasks is bound by the number and quality of annotated data, but there is often a shortage of such training data. In this paper, we ask the question: “Can we combine a neural network (NN) with regular expressions (RE) to improve supervised learning for NLP?”. In answer, we develop novel methods to exploit the rich expressiveness of REs at different levels within a NN, showing that the combination significantly enhances the learning effectiveness when a small number of training examples are available. We evaluate our approach by applying it to spoken language understanding for intent detection and slot filling. Experimental results show that our approach is highly effective in exploiting the available training data, giving a clear boost to the RE-unaware NN.


Introduction
Regular expressions (REs) are widely used in various natural language processing (NLP) tasks like pattern matching, sentence classification, sequence labeling, etc. (Chang and Manning, 2014). As a technique based on human-crafted rules, it is concise, interpretable, tunable, and does not rely on much training data to generate. As such, it is commonly used in industry, especially when the available training examples are limited -a problem known as few-shot learning (GC et al., 2015).
While powerful, REs have a poor generalization ability because all synonyms and variations in a RE must be explicitly specified. As a result, REs are often ensembled with data-driven methods, such as neural network (NN) based techniques, where a set of carefully-written REs are used to handle certain cases with high precision, leaving the rest for data-driven methods.
We believe the use of REs can go beyond simple pattern matching. In addition to being a separate classifier to be ensembled, a RE also encodes a developer's knowledge for the problem domain. The knowledge could be, for example, the informative words (clue words) within a RE's surface form. We argue that such information can be utilized by data-driven methods to achieve better prediction results, especially in few-shot learning.
This work investigates the use of REs to improve NNs -a learning framework that is widely used in many NLP tasks (Goldberg, 2017). The combination of REs and a NN allows us to exploit the conciseness and effectiveness of REs and the strong generalization ability of NNs. This also provides us an opportunity to learn from various kinds of REs, since NNs are known to be good at tolerating noises (Xie et al., 2016). This paper presents novel approaches to combine REs with a NN at different levels. At the input layer, we propose to use the evaluation outcome of REs as the input features of a NN (Sec.3.2). At the network module level, we show how to exploit the knowledge encoded in REs to guide the attention mechanism of a NN (Sec. 3.3). At the output layer, we combine the evaluation outcome of a RE with the NN output in a learnable manner (Sec. 3.4).
We evaluate our approach by applying it to two spoken language understanding (SLU) tasks, namely intent detection and slot filling, which respectively correspond to two fundamental NLP tasks: sentence classification and sequence labeling. To demonstrate the usefulness of REs in realworld scenarios where the available number of annotated data can vary, we explore both the fewshot learning setting and the one with full training data. Experimental results show that our approach is highly effective in utilizing the available Figure 1: A sentence from the ATIS dataset. REs can be used to detect the intent and label slots. annotated data, yielding significantly better learning performance over the RE-unaware method.
Our contributions are as follows.
(1) We present the first work to systematically investigate methods for combining REs with NNs.
(2) The proposed methods are shown to clearly improve the NN performance in both the few-shot learning and the full annotation settings. (3) We provide a set of guidance on how to combine REs with NNs and RE annotation.

Typesetting
In this paper, we use italic for emphasis like intent detection, the Courier typeface for abbreviations like RE, bold italic for the first appearance of a concept like clue words, Courier surrounded by / for regular expressions like /list( the)? AIRLINE/, and underlined italic for words of sentences in our dataset like Boston.

Problem Definition
Our work targets two SLU tasks: intent detection and slot filling. The former is a sentence classification task where we learn a function to map an input sentence of n words, x = [x 1 , ..., x n ], to a corresponding intent label, c. The latter is a sequence labeling task for which we learn a function to take in an input query sentence of n words, x = [x 1 , ..., x n ], to produce a corresponding labeling sequence, y = [y 1 , ..., y n ], where y i is the slot label of the corresponding word, x i . Take the sentence in Fig. 1 as an example. A successful intent detector would suggest the intent of the sentence as flight, i.e., querying about flight-related information. A slot filler, on the other hand, should identify the slots fromloc.city and toloc.city by labeling Boston and Miami, respectively, using the begin-inside-outside (BIO) scheme.

The Use of Regular Expressions
In this work, a RE defines a mapping from a text pattern to several REtags which are the same as or related to the target labels (i.e., intent and slot labels). A search function takes in a RE, applies it to all sentences, and returns any texts that match the pattern. We then assign the REtag (s) (that are associated with the matching RE) to either the matched sentence (for intent detection) or some matched phrases (for slot filling).
Specifically, our REtags for intent detection are the same as the intent labels. For example, in Fig. 1, we get a REtag of flight that is the same as the intent label flight.
For slot filling, we use two different sets of REs. Given the group functionality of RE, we can assign REtags to our interested RE groups (i.e., the expressions defined inside parentheses). The translation from REtags to slot labels depends on how the corresponding REs are used. (1) When REs are used at the network module level (Sec. 3.3), the corresponding REtags are the same as the target slot labels. For instance, the slot RE in Fig. 1 will assign fromloc.city to the first RE group and toloc.city to the second one. Here, CITY is a list of city names, which can be replaced with a RE string like /Boston|Miami|LA|.../. (2) If REs are used in the input (Sec. 3.2) and the output layers (Sec. 3.4) of a NN, the corresponding REtag would be different from the target slot labels. In this context, the two RE groups in Fig. 1 would be simply tagged as city to capture the commonality of three related target slot labels: fromloc.city, toloc.city, stoploc.city. Note that we could use the target slot labels as REtags for all the settings. The purpose of abstracting REtags to a simplified version of the target slot labels here is to show that REs can still be useful when their evaluation outcome does not exactly match our learning objective. Further, as shown in Sec. 4.2, using simplified REtags can also make the development of REs easier in our tasks.
Intuitively, complicated REs can lead to better performance but require more efforts to generate. Generally, there are two aspects affecting RE complexity most: the number of RE groups 1 and or clauses (i.e., expressions separated by the disjunction operator |) in a RE group. Having a larger number of RE groups often leads to better precision but lower coverage on pattern matching, while a larger number of or clauses usually gives a higher coverage but slightly lower precision.

Our Approach
As depicted in Fig. 2, we propose to combine NNs and REs from three different angles.

Base Models
We use the Bi-directional LSTM (BLSTM) as our base NN model because it is effective in both intent detection and slot filling (Liu and Lane, 2016).
Intent Detection. As shown in Fig. 2, the BLSTM takes as input the word embeddings [x 1 , ..., x n ] of a n-word sentence, and produces a vector h i for each word i. A self-attention layer then takes in the vectors produced by the BLSTM to compute the sentence embedding s: where α i is the attention for word i, c is a randomly initialized trainable vector used to select informative words for classification, and W is a weight matrix. Finally, s is fed to a softmax classifier for intent classification.
Slot Filling. The model for slot filling is straightforward -the slot label prediction is generated by a softmax classier which takes in the BLSTM's output h i and produces the slot label of word i. Note that attention aggregation in Fig. 2 is only employed by the network module level method presented in Sec. 3.3.

Using REs at the Input Level
At the input level, we use the evaluation outcomes of REs as features which are fed to NN models.
Intent Detection. Our REtag for intent detection is the same as our target intent label. Because real-world REs are unlikely to be perfect, one sentence may be matched by more than one RE. This may result in several REtags that are conflict with each other. For instance, the sentence list the Delta airlines flights to Miami can match a RE: /list( the)? AIRLINE/ that outputs tag airline, and another RE: /list( \w+){0,3} flights?/ that outputs tag flight.
To resolve the conflicting situations illustrated above, we average the randomly initialized trainable tag embeddings to form an aggregated embedding as the NN input. There are two ways to use the aggregated embedding. We can append the aggregated embedding to either the embedding of every input word, or the input of the softmax classifier (see 1 in Fig. 2(a)). To determine which strategy works best, we perform a pilot study. We found that the first method causes the tag embedding to be copied many times; consequently, the NN tends to heavily rely on the REtags, and the resulting performance is similar to the one given by using REs alone in few-shot settings. Thus, we adopt the second approach.
Slot Filling. Since the evaluation outcomes of slot REs are word-level tags, we can simply embed and average the REtags into a vector f i for each word, and append it to the corresponding word embedding w i (as shown in 1 in Fig. 2(b)). Note that we also extend the slot REtags into the BIO format, e.g., the REtags of phrase New York are B-city and I-city if its original tag is city.

Using REs at the Network Module Level
At the network module level, we explore ways to utilize the clue words in the surface form of a RE (bold blue arrows and words in 2 of Fig. 2) to guide the attention module in NNs.
Intent Detection. Taking the sentence in Fig. 1 for example, the RE: /ˆflights? from/ that leads to intent flight means that flights from are the key words to decide the intent flight. Therefore, the attention module in NNs should leverage these two words to get the correct prediction. To this end, we extend the base intent model by making two changes to incorporate the guidance from REs.
First, since each intent has its own clue words, using a single sentence embedding for all intent labels would make the attention less focused. Therefore, we let each intent label k use different attention a k , which is then used to generate the sentence embedding s k for that intent: where c k is a trainable vector for intent k which is used to compute attention a k , h i is the BLSTM output for word i, and W a is a weight matrix. The probability p k that the input sentence expresses intent k is computed by: where w k , logit k , b k are weight vector, logit, and bias for intent k, respectively.
Second, apart from indicating a sentence for intent k (positive REs), a RE can also indicate that a sentence does not express intent k (negative REs). We thus use a new set of attention (negative attentions, in contrast to positive attentions), to compute another set of logits for each intent with Eqs. 2 and 3. We denote the logits computed by positive attentions as logit pk , and those by negative attentions as logit nk , the final logit for intent k can then be calculated as: To use REs to guide attention, we add an attention loss to the final loss: where t ki is set to 0 when none of the matched REs (that leads to intent k) marks word i as a clue word -otherwise t ki is set to 1/l k , where l k is the number of clue words for intent k (if no matched RE leads to intent k, then t k * = 0). We use Eq. 5 to compute the positive attention loss, loss att p , for positive REs and negative attention loss, loss att n , for negative ones. The final loss is computed as: loss = loss c + β p loss att p + β n loss att n (6) where loss c is the original classification loss, β p and β n are weights for the two attention losses.
Slot Filling. The two-side attention (positive and negative attention) mechanism introduced for intent prediction is unsuitable for slot filling. Because for slot filling, we need to compute attention for each word, which demands more compu-tational and memory resources than doing that for intent detection 2 .
Because of the aforementioned reason, we use a simplified version of the two-side attention, where all the slot labels share the same set of positive and negative attention. Specifically, to predict the slot label of word i, we use the following equations, which are similar to Eq. 1, to generate a sentence embedding s pi with regard to word i from positive attention: where h i and h j are the BLSTM outputs for word i and j respectively, W sp is a weight matrix, and α pij is the positive attention value for word j with respect to word i. Further, by replacing W sp with W sn , we use Eq. 7 again to compute negative attention and generate the corresponding sentence embedding s ni .
Finally, the prediction p i for word i can be calculated as: where W p , W n , b p , b n are weight matrices and bias vectors for positive and negative attention, respectively. Here we append the BLSTM output h i to s pi and s ni because the word i itself also plays a crucial part in identifying its slot label.

Using REs at the Output Level
At the output level, REs are used to amend the output of NNs. At this level, we take the same approach used for intent detection and slot filling (see 3 in Fig. 2). As mentioned in Sec. 2.3, the slot REs used in the output level only produce a simplified version of target slot labels, for which we can further annotate their corresponding target slot labels. For instance, a RE that outputs city can lead to three slot labels: fromloc.city, toloc.city, stoploc.city.
Let z k be a 0-1 indicator of whether there is at least one matched RE that leads to target label k (intent or slot label), the final logits of label k for a sentence (or a specific word for slot filling) is: where logit k is the logit produced by the original NN, and w k is a trainable weight indicating the overall confidence for REs that lead to target label k. Here we do not assign a trainable weight for each RE because it is often that only a few sentences match a RE.
We modify the logit instead of the final probability because a logit is an unconstrained real value, which matches the property of w k z k better than probability. Actually, when performing model ensemble, ensembling with logits is often empirically better than with the final probability 3 . This is also the reason why we choose to operate on logits in Sec. 3.3.

Evaluation Methodology
Our experiments aim to answer three questions: Q1: Does the use of REs enhance the learning quality when the number of annotated instances is small? Q2: Does the use of REs still help when using the full training data? Q3: How can we choose from different combination methods?

Datasets
We use the ATIS dataset (Hemphill et al., 1990) to evaluate our approach. This dataset is widely used in SLU research. It includes queries of flights, meal, etc. We follow the setup of Liu and Lane (2016) by using 4,978 queries for training and 893 for testing, with 18 intent labels and 127 slot labels. We also split words like Miami's into Miami 's during the tokenization phase to reduce the number of words that do not have a pre-trained word embedding. This strategy is useful for fewshot learning.
To answer Q1 , we also exploit the full few-shot learning setting. Specifically, for intent detection, we randomly select 5, 10, 20 training instances for each intent to form the few-shot training set; and for slot filling, we also explore 5, 10, 20 shots settings. However, since a sentence typically contains multiple slots, the number of mentions of frequent slot labels may inevitably exceeds the target shot count. To better approximate the target shot count, we select sentences for each slot label in ascending order of label frequencies. That is k 1 -shot dataset will contain k 2 -shot dataset if k 1 > k 2 . All settings use the original test set.
Since most existing few-shot learning methods require either many few-shot classes or some classes with enough data for training, we also explore the partial few-shot learning setting for intent detection to provide a fair comparison for existing few-shot learning methods. Specifically, we let the 3 most frequent intents have 300 training instances, and the rest remains untouched. This is also a common scenario in real world, where we often have several frequent classes and many classes with limited data. As for slot filling, however, since the number of mentions of frequent slot labels already exceeds the target shot count, the original slot filling few-shot dataset can be directly used to train existing few-shot learning methods. Therefore, we do not distinguish full and partial few-shot learning for slot filling.

Preparing REs
We use the syntax of REs in Perl in this work. Our REs are written by a paid annotator who is familiar with the domain. It took the annotator in total less than 10 hours to develop all the REs, while a domain expert can accomplish the task faster. We use the 20-shot training data to develop the REs, but word lists like cities are obtained from the full training set. The development of REs is considered completed when the REs can cover most of the cases in the 20-shot training data with resonable precision. After that, the REs are fixed throughout the experiments.
The majority of the time for writing the REs is proportional to the number of RE groups. It took about 1.5 hours to write the 54 intent REs with on average 2.2 groups per RE. It is straightforward to write the slot REs for the input and output level methods, for which it took around 1 hour to write the 60 REs with 1.7 groups on average. By con-trast, writing slot REs to guide attention requires more efforts as the annotator needs to carefully select clue words and annotate the full slot label. As a result, it took about 5.5 hours to generate 115 REs with on average 3.3 groups. The performance of the REs can be found in the last line of Table 1. In practice, a positive RE for intent (or slot) k can often be treated as negative REs for other intents (or slots). As such, we use the positive REs for intent (or slot) k as the negative REs for other intents (or slots) in our experiments.

Experimental Setup
Hyper-parameters. Our hyper-parameters for the BLSTM are similar to the ones used by Liu and Lane (2016). Specifically, we use batch size 16, dropout probability 0.5, and BLSTM cell size 100. The attention loss weight is 16 (both positive and negative) for full few-shot learning settings and 1 for other settings. We use the 100d GloVe word vectors (Pennington et al., 2014) pre-trained on Wikipedia and Gigaword (Parker et al., 2011), and the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001.
Evaluation Metrics. We report accuracy and macro-F1 for intent detection, and micro/macro-F1 for slot filling. Micro/macro-F1 are the harmonic mean of micro/macro precision and recall. Macro-precision/recall are calculated by averaging precision/recall of each label, and microprecision/recall are averaged over each prediction.
Competitors and Naming Conventions. Here, a bold Courier typeface like BLSTM denotes the notations of the models that we will compare in Sec. 5.
Specifically, we compare our methods with the baseline BLSTM model (Sec. 3.1). Since our attention loss method (Sec. 3.3) uses two-side attention, we include the raw two-side attention model without attention loss (+two) for comparison as well. Besides, we also evaluate the RE output (REO), which uses the REtags as prediction directly, to show the quality of the REs that we will use in the experiments. 4 As for our methods for combinging REs with NN, +feat refers to using REtag as input features (Sec. 3.2), +posi and +neg refer to using positive and negative attention loss respectively, +both refers to using both postive and negative attention losses (Sec. 3.3), and +logit means using REtag to modify NN output (Sec. 3.4).
Moverover, since the REs can also be formatted as first-order-logic (FOL) rules, we also compare our methods with the teacher-student framework proposed by Hu et al. (2016a), which is a general framework for distilling knowledge from FOL rules into NN (+hu16). Besides, since we consider few-short learning, we also include the memory module proposed by Kaiser et al. (2017), which performs well in various few-shot datasets (+mem) 5 . Finally, the state-of-art model on the ATIS dataset is also included (L&L16), which jointly models the intent detection and slot filling in a single network (Liu and Lane, 2016).

Full Few-Shot Learning
To answer Q1 , we first explore the full few-shot learning scenario.
Intent Detection. As shown in Table 1, except for 5-shot, all approaches improve the baseline BLSTM. Our network-module-level methods give the best performance because our attention module directly receives signals from the clue words in REs that contain more meaningful information than the REtag itself used by other methods. We also observe that since negative REs are derived from positive REs with some noises, posi performs better than neg when the amount of available data is limited. However, neg is slightly better in 20-shot, possibly because negative REs significantly outnumbers the positive ones. Besides, two alone works better than the BLSTM when there are sufficient data, confirming the advantage of our two-side attention architecture.
As for other proposed methods, the output level method (logit) works generally better than the input level method (feat), except for the 5-shot case. We believe this is due to the fewer number of RE related parameters and the shorter distance that the gradient needs to travel from the loss to these parameters -both make logit easier to train. However, since logit directly modifies the output, the final prediction is more sensitive to the insufficiently trained weights in logit, leading to the inferior results in the 5-shot setting. To compare with existing methods of combining NN and rules, we also implement the teacherstudent network (Hu et al., 2016a). This method lets the NN learn from the posterior label distribution produced by FOL rules in a teacher-student framework, but requires considerable amounts of data. Therefore, although both hu16 and logit operate at the output level, logit still performs better than hu16 in these few-shot settings, since logit is easier to train.
It can also be seen that starting from 10-shot, two+both significantly outperforms pure REO. This suggests that by using our attention loss to connect the distributional representation of the NN and the clue words of REs, we can generalize RE patterns within a NN architecture by using a small amount of annotated data.
Slot Filling. Different from intent detection, as shown in Table 1, our attention loss does not work for slot filling. The reason is that the slot label of a target word (the word for which we are trying to predict a slot label) is decided mainly by the semantic meaning of the word itself, together with 0-3 phrases in the context to provide supplementary information. However, our attention mechanism can only help in recognizing clue words in the context, which is less important than the word itself and have already been captured by the BLSTM, to some extent. Therefore, the attention loss and the attention related parameters are more of a burden than a benefit. As is shown in Fig. 1, the model recognizes Boston as fromloc.city mainly because Boston itself is a city, and its context word from may have already been captured by the BLSTM and our attention mechanism does not help much. By examining the attention values of +two trained on the full dataset, we find that instead of mark-ing informative context words, the attention tends to concentrate on the target word itself. This observation further reinforces our hypothesis on the attention loss.
On the other hand, since the REtags provide extra information, such as type, about words in the sentence, logit and feat generally work better. However, different from intent detection, feat only outperforms logit by a margin. This is because feat can use the REtags of all words to generate better context representations through the NN, while logit can only utilize the REtag of the target word before the final output layer. As a result, feat actually gathers more information from REs and can make better use of them than logit. Again, hu16 is still outperformed by logit, possibly due to the insufficient data support in this few-shot scenario. We also see that even the BLSTM outperforms REO in 5-shot, indicating while it is hard to write high-quality RE patterns, using REs to boost NNs is still feasible.
Summary. The amount of extra information that a NN can utilize from the combined REs significantly affects the resulting performance. Thus, the attention loss methods work best for intent detection and feat works best for slot filling. We also see that the improvements from REs decreases as having more training data. This is not surprising because the implicit knowledge embedded in the REs are likely to have already been captured by a sufficient large annotated dataset and in this scenario using the REs will bring in fewer benefits.

Partial Few-Shot Learning
To better understand the relationship between our approach and existing few-shot learning methods, we also implement the memory network method   Table 3: Results on Full Dataset. The left side of '/' applies for intent, and the right side for slot. (Kaiser et al., 2017) which achieves good results in various few-shot datasets. We adapt their opensource code, and add their memory module (mem) to our BLSTM model. Since the memory module requires to be trained on either many few-shot classes or several classes with extra data, we expand our full few-shot dataset for intent detection, so that the top 3 intent labels have 300 sentences (partial few-shot).
As shown in Table 2, mem works better than BLSTM, and our attention loss can be further combined with the memory module (mem+posi), with even better performance. hu16 also works here, but worse than two+both. Note that, the memory module requires the input sentence to have only one embedding, thus we only use one set of positive attention for combination.
As for slot filling, since we already have extra data for frequent tags in the original few-shot data (see Sec. 4.1), we use them directly to run the memory module. As shown in the bottom of Table  1, mem also improves the base BLSTM, and gains further boost when it is combined with feat 6 .

Full Dataset
To answer Q2, we also evaluate our methods on the full dataset. As seen in Table 3, for intent detection, while two+both still works, feat and logit no longer give improvements. This shows 6 For compactness, we only combine the best method in each task with mem, but others can also be combined.  Table 4: Results on 20-Shot Data with Simple REs. +both refers to +two +both for short.
that since both REtag and annotated data provide intent labels for the input sentence, the value of the extra noisy tag from RE become limited as we have more annotated data. However, as there is no guidance on attention in the annotations, the clue words from REs are still useful. Further, since feat concatenates REtags at the input level, the powerful NN makes it more likely to overfit than logit, therefore feat performs even worse when compared to the BLSTM.
As for slot filling, introducing feat and logit can still bring further improvements. This shows that the word type information contained in the REtags is still hard to be fully learned even when we have more annotated data. Moreover, different from few-shot settings, two+both has a better macro-F1 score than the BLSTM for this task, suggesting that better attention is still useful when the base model is properly trained.
Again, hu16 outperforms the BLSTM in both tasks, showing that although the REtags are noisy, their teacher-student network can still distill useful information. However, hu16 is a general framework to combine FOL rules, which is more indirect in transferring knowledge from rules to NN than our methods. Therefore, it is still inferior to attention loss in intent detection and feat in slot filling, which are designed to combine REs.
Further, mem generally works in this setting, and can receive further improvement by combining our fusion methods. We can also see that two+both works clearly better than the stateof-art method (L&L16) in intent detection, which jointly models the two tasks. And mem+feat is comparative to L&L16 in slot filling.

Impact of the RE Complexity
We now discuss how the RE complexity affects the performance of the combination. We choose to control the RE complexity by modifying the number of groups. Specifically, we reduce the number of groups for existing REs to decrease RE complexity. To mimic the process of writing simple REs from scratch, we try our best to keep the key RE groups. For intent detection, all the REs are reduced to at most 2 groups. As for slot filling, we also reduce the REs to at most 2 groups, and for some simples case, we further reduce them into word-list patterns, e.g., ( CITY).
As shown in Table 4, the simple REs already deliver clear improvements to the base NN models, which shows the effectiveness of our methods, and indicates that simple REs are quite costefficient since these simple REs only contain 1-2 RE groups and thus very easy to produce. We can also see that using complex REs generally leads to better results compared to using simple REs. This indicates that when considering using REs to improve a NN model, we can start with simple REs, and gradually increase the RE complexity to improve the performance over time 7 .

Related Work
Our work builds upon the following techniques, while qualitatively differing from each NN with Rules. On the initialization side,  uses important n-grams to initialize the convolution filters. On the input side, Wang et al. (2017a) uses knowledge base rules to find relevant concepts for short texts to augment input. On the output side, Hu et al. (2016a;2016b) and Guo et al. (2017) use FOL rules to rectify the output probability of NN, and then let NN learn from the rectified distribution in a teacher-student framework. Xiao et al. (2017), on the other hand, modifies the decoding score of NN by multiplying a weight derived from rules. On the loss function side, people modify the loss function to model the relationship between premise and conclusion (Demeester et al., 2016), and fit both human-annotated and rule-annotated labels (Alashkar et al., 2017). Since fusing in initialization or in loss function often require special properties of the task, these approaches are not applicable to our problem. Our work thus offers new ways to exploit RE rules at different levels of a NN.
NNs and REs. As for NNs and REs, previous work has tried to use RE to speed up the decoding phase of a NN (Strauß et al., 2016) and generating REs from natural language specifications of the 7 We do not include results of both for slot filling since its REs are different from feat and logit, and we have already shown that the attention loss method does not work for slot filling.
RE (Locascio et al., 2016). By contrast, our work aims to use REs to improve the prediction ability of a NN.
Few-Shot Learning. Prior work either considers few-shot learning in a metric learning framework (Koch et al., 2015;Vinyals et al., 2016), or stores instances in a memory (Santoro et al., 2016;Kaiser et al., 2017) to match similar instances in the future. Wang et al. (2017b) further uses the semantic meaning of the class name itself to provide extra information for few-shot learning. Unlike these previous studies, we seek to use the humangenerated REs to provide additional information.
Natural Language Understanding. Recurrent neural networks are proven to be effective in both intent detection (Ravuri and Stoicke, 2015) and slot filling (Mesnil et al., 2015). Researchers also find ways to jointly model the two tasks (Liu and Lane, 2016;Zhang and Wang, 2016). However, no work so far has combined REs and NNs to improve intent detection and slot filling.

Conclusions
In this paper, we investigate different ways to combine NNs and REs for solving typical SLU tasks. Our experiments demonstrate that the combination clearly improves the NN performance in both the few-shot learning and the full dataset settings. We show that by exploiting the implicit knowledge encoded within REs, one can significantly improve the learning performance. Specifically, we observe that using REs to guide the attention module works best for intent detection, and using REtags as features is an effective approach for slot filling. We provide interesting insights on how REs of various forms can be employed to improve NNs, showing that while simple REs are very cost-effective, complex REs generally yield better results.