Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks

Neural networks can achieve impressive performance on many natural language processing applications, but they typically need large labeled data for training and are not easily interpretable. On the other hand, symbolic rules such as regular expressions are interpretable, require no training, and often achieve decent accuracy; but rules cannot beneﬁt from labeled data when available and hence underperform neural networks in rich-resource scenarios. In this paper, we propose a type of recurrent neural networks called FA-RNNs that combine the advantages of neural networks and regular expression rules. An FA-RNN can be converted from regular expressions and deployed in zero-shot and cold-start scenarios. It can also utilize labeled data for training to achieve improved prediction accuracy. After training, an FA-RNN often remains interpretable and can be converted back into regular expressions. We apply FA-RNNs to text classiﬁcation and observe that FA-RNNs signiﬁcantly outperform previous neural approaches in both zero-shot and low-resource settings and remain very competitive in rich-resource settings.


Introduction
Over the past several years, neural network approaches have rapidly gained popularity in natural language processing (NLP) because of their impressive performance and flexible modeling capacity. Nevertheless, symbolic rules are still an indispensable tool in various industrial NLP applications. Regular expressions (RE) are one of the most representative and useful forms of symbolic rules and are widely used for solving tasks such as pattern matching (Hosoya and Pierce, 2001;Zhang et al., 2018) and intent classification (Luo et al., 2018). RE-based systems are highly interpretable * Corresponding author. and therefore support fine-grained human inspection and manipulation. For example, individual RE rules in a system can be easily added, revised, or removed to quickly adapt the system to changes in the task specification. Moreover, RE-based systems do not require a training stage with labeled data and hence can be quickly deployed with decent performance in zero-shot scenarios. However, REs rely on human experts to write and often have high precision but moderate to low recall; RE-based systems cannot evolve by training on labeled data when available and thus usually underperform neural networks in rich-resource scenarios.
How to combine the advantages of symbolic rules and neural networks is an open question and is drawing increasing attention recently. One possible way is to use rules to constrain neural networks, usually in the manner of regularization via knowledge distillation (Hu et al., 2016) and multi-task learning (Awasthi et al., 2020;Xu et al., 2018), or by tuning the output logits of neural networks (Li and Srikumar, 2019;Luo et al., 2018). In this way, information from rules can be injected into neural networks, though the neural networks still require training and remain black boxes that are hard to interpret and manipulate. Another way of utilizing rules is to design novel neural network architectures inspired by rule systems (Schwartz et al., 2018;Graves et al., 2014;Peng et al., 2018;Lin et al., 2019). Models designed based on this idea usually achieve better interpretability, but they must be trained on labeled data and cannot be directly converted from rules or manually specified by human experts because of their structural differences from rule systems.
In this paper, we propose finite-automaton recurrent neural networks (FA-RNN), a novel type of recurrent neural networks that is designed based on the computation process of weighted finite-state automata. Because of the equivalence between Label [distance] RE $*(how ( far | long ) | distance) $* Matched Text BOS tell me how far is oakland airport from downtown EOS FA Table 1: RE for matching sentences asking about distance, and a matched sentence. '$' is the wildcard. '|' is the OR operator. '*' is the Kleene star operator. We also show the finite automaton converted from the RE. s 2 is the final state.
REs and finite-state automata, we can convert any REs into an FA-RNN, which can be deployed in zero-shot and cold-start scenarios. When there are labeled data, the FA-RNN can also be trained in the same way as any neural network, which improves its prediction accuracy over the original REs. The FA-RNN has good interpretability. When converted from REs, it is (approximately) equivalent to the REs and is fully interpretable. Even after training, it often remains highly interpretable and can be converted back into REs. The interpretability of FA-RNNs opens the possibility of fine-grained manipulation such as integrating new REs into a trained FA-RNN and disabling old REs that are used to initialize an FA-RNN.
We apply FA-RNNs to the text classification task and compare them with neural network baselines as well as existing approaches of integrating REs and neural networks. Our experiments find that FA-RNNs show clear advantages in both zero-shot and low-resource settings and remain very competitive in rich-resource settings.

Regular Expressions
Regular expressions (RE) are patterns usually used for searching or matching a string and are a succinct way to denote regular languages. We show a simple example RE for matching sentences 1 in Table 1.

RE System for Text Classification
The text classification task aims to assign a class label to an input sentence. Let x = x 0 , · · · , x N be a sentence and L = {l 1 , · · · , l k } be the label set. One common and straight-forward way to use REs 1 Example taken from the ATIS intent classification dataset. for classification is as follows. Firstly, write m REs R = {r 1 , · · · , r m }, where each RE corresponds to some label in L. Then, for each sentence x, apply these REs to get matching results. Finally, aggregate the matching results to produce a final label for sentence x based on a set of propositional logic rules. Each rule specifies a logical expression of matching results that implies a specific label. For example, let M i represent whether RE r i is matched, then we may have a rule: The whole procedure is shown in the top half of Figure.1.

Finite-State Automaton
Finite-state automata (FA) are machines with finite numbers of states. An FA can transit from one state to another in response to an input. It has a start state s 0 and a set of final states S ∞ . Every RE can be converted into an FA expressing the same language by Thompson's construction algorithm (Thompson, 1968). For a sequence x = x 1 , · · · , x N , an RE matches the sequence if and only if the converted FA starts from s 0 and finally reaches a final state after consuming x. Table 1 shows an FA converted from the example RE. Further, for every RE, there exists a unique FA with a minimum number of states and deterministic transitions (m-DFA) such that they express the same language (Hopcroft et al., 2001). Deterministic transitions mean that given a current state and an input, there is a unique next state. The m-DFA can be obtained by running the powerset construction algorithm (Rabin and Scott, 1959) and the DFA minimization algorithm (Hopcroft, 1971).
• T ∈ R V ×K×K : a tensor of transition weights.
T [σ, i, j] is the weight of transiting from s i to s j in response to input σ. T [σ] ∈ R K×K denotes the transition matrix of σ.
is the weight of staying at state s i after reading all the inputs. An FA can be seen as a WFA with 0/1 weights.
where 1() is the indicator function and S 0 denote the set of start states 2 .
For sequence x, the score of WFA A accepting x can be calculated using the forward (Baum and Petrie, 1966) and Viterbi (Viterbi, 1967) algorithms. Let path p = u 1 , · · · , u N +1 be a sequence of indexes of the states that we visit when consuming x. The score B(A, p) of path p can be computed by: Let π(x) be the set of all paths that start from start state s 0 and reach a final state s i ∈ S ∞ after consuming sequence x. The forward algorithm computes the sum of path scores.
The Viterbi algorithm computes the maximum of path scores.
It can be computed by replacing matrix multiplication in Eqa.2 with the max-plus operator. For an FA A, the forward score is exactly the number of paths in π(x) while the Viterbi score indicates whether π(x) is non-empty.

Method
We show step-by-step how we can convert REs to a novel type of recurrent neural networks called FA-RNNs.

From REs to Recurrent Neural Networks
RE to FA As mentioned in Sec.2.3, we can convert an RE into an m-DFA. In order to obtain a concise FA with better interpretability and faster computation speed, we treat the wildcard '$' as a special word in the vocabulary and run the algorithms mentioned in Sec.2.3 to obtain a "pseudo" m-DFA A.
FA as RNN As discussed in Sec.2.4, the FA A can be seen as a WFA with 0/1 weights which is parameterized by Θ = α 0 , T , α ∞ . The computation of the WFA forward score (Eqa.2) can be rewritten into a recurrent form. Let h t ∈ R K be the forward score vector after consuming t words in x. h t [i] can be interpreted as the number of paths starting from s 0 and reaching s i at step t.
The computation of the WFA Viterbi score can be formulated in a similar way. Therefore, we can view a WFA as a form of recurrent neural networks (RNN) parameterized by Θ.

Decomposing the Parameter Tensor
Despite the equivalence to FAs and hence better interpretability, the RNNs proposed in Sec.3.1 has much more parameters than a traditional RNN because of the tensor T ∈ R V ×K×K . To reduce the parameter number, we propose to apply tensor rank decomposition (explained in the Appendix.A) and decompose T into three matrices E R ∈ R V ×r , D 1 ∈ R K×r , D 2 ∈ R K×r , where r is a hyper-parameter. Note that if r is smaller than the rank of T , then the decomposition is approximate. We empirically find that, for a 100-state FA converted from RE, we can obtain a small decomposition error (≤ 1%) if r ≥ 100. Now the RNN is parameterized by Θ D = α 0 , α ∞ , E R , D 1 , D 2 . E R has a dimension associated with vocabulary size V and can be viewed as a word embedding matrix containing RE information for each word. Let v t ∈ R r be the embedding of word x t contained in E R . The recurrent update in Eqa.4 becomes: where • denotes element-wise product. Eqa.5 produces the same result as Eqa.4 with sufficiently large r.
Note that the size of h t is determined by the state number K of the m-DFA converted from RE. In some cases, K may be too small, resulting in limited representational power of the RNN. A simple method to solve this problem is to concatenate D 1 and D 2 with a K ×r zero matrix, hence increasing the hidden state size by K . Subsequent training (to be introduced later) would update D 1 and D 2 so that these added dimensions can be utilized. This is equivalent to adding K isolated states in the FA and relying on training to establish transitions between the old and new states.

Integrating Pretrained Word Embedding
Pretrained word embeddings have been found very useful in bringing external lexical knowledge into neural networks. Let E w ∈ R V ×D be the word embedding matrix and u t ∈ R D be the word embedding of x t in E w . We introduce another matrix G ∈ R D×r that can transform the D-dimensional word embedding u t into r-dimension, which can then replace v t in the recurrent update of Eqa.5. We initialize G by setting G = E † w E R , where E † w is the pseudo-inverse of E w . In this way, we approximate v t with u t G and hence the initialized RNN still tries to mimic the FA. After training, however, the RNN will be able to utilize the additional information contained in pretrained word embeddings and hence may outperform the original FA.
In practice, we find it beneficial to interpolate the two r-dimension embeddings v t and u t G with a hyper-parameter β ∈ [0, 1]. When β is 1, we only use RE information. When β gets closer to 0, we integrate more external lexical information into the model. The recurrent update formula becomes: We name this new form of RNNs as FA-RNNs, i.e., recurrent neural networks built from finite-state automata.

Extensions of FA-RNN
Gated Extension (FA-GRU) Inspired by the Gated Recurrent Unit (Chung et al., 2014), we sacrifice some interpretability and add an update gate f t and a reset gate r t into the FA-RNN. The update gate determines how much information from the past shall be retained. The reset gate determines whether to reset the previous score vector to h 0 , The recurrent update is as follows.
σ is the sigmoid activation function and W f , W r , U f , U r are additional parameters for gates. Note that when f t and r t is close to 1, the FA-GRU degenerates to the FA-RNN. Therefore, we initialize b f , b r to a large value and W f , W r , U f , U r randomly using Xavier initialization (Glorot and Bengio, 2010) to ensure that the initialized FA-GRU is approximately equivalent to the FA-RNN and hence the original REs.
Bidirectional Extension (BiFA-RNN) Our networks can be easily extended to their bidirectional variants. For any RE, we can reverse it by simply reversing its word order (e.g., "free $* ( phone | phones ) $*" can be reversed to "$* (phone | phones) $* free") and then convert the reversed RE into WFA ← − A and the corresponding FA-RNN. Score vector ←− h N can be computed by applying Eqa.6 or Eqa.7 on the reversed input sentence ← − x . Then we take the average of ← − h N and the left-toright score vector − → h N to obtain the final score vec-

Aggregation Layer for Text Classification
As introduced in Sec.2.2, an RE system for text classification contains multiple REs that are aggregated to form a class label prediction. Here we describe how to convert such an RE system to an FA-RNN system for text classification (the bottom half of Figure.1).
For each RE r i in the RE system, we convert it into a WFA A i with K i states, start weights α 0,i , and final weights α ∞,i . We can view these WFAs as a single WFA A with a total number of K = i K i states and multiple start states. We then convert this WFA to an FA-RNN. After we run this FA-RNN on sentence x, the last state vector h N contains the matching information of all the REs.
To predict a class label from h N , we create a soft aggregation layer. First, we extract the forward or Viterbi score of each RE from h N . For forward scoring, we follow Eqa.4 and have:  Table 2). Instead of predicting a single label, the soft aggregation layer outputs the label logits l ∈ R k . When all the elements in h N are close to either 0 or 1, the output of the soft aggregation layer is approximately equivalent to that of the RE aggregation layer of Sec.2.2. Since the logical RE aggregation rules can be expressed in the conjunctive normal form, we can implement the corresponding soft aggregation layer with a two-layer MLP with ReLU-like activation functions. This is similar to the MLP layer commonly used at the end of traditional neural networks to map the hidden representation to label logits. In practice, we find it sometimes beneficial to not use any activation function in the MLP.

Training with Labeled Data
So far we have introduced how to initialize an FA-RNN system that is approximately equivalent to an RE classification system. When there are labeled data, the FA-RNN can also be trained to improve its performance. We simply use the output logits l to compute the cross-entropy loss on the training data and use a gradient-based method such as Adam (Kingma and Ba, 2014) to optimize it.   Table 4: Accuracy of zero-shot classification. The RE system and baselines trained on RE-labeled data are included for reference.
Note that we typically fix E R during training because we find that updating E R is not helpful. Therefore, the number of trainable parameters in an FA-RNN is similar to (usually smaller than) that of an RNN. We compare the number of parameters of different models in Appendix.C.

Experiments
We use the forward score version of FA-RNNs by default in our experiments. We use GloVe (Pennington et al., 2014) as the word embedding and keep it fixed for our methods and all the baselines. We tune the learning rate, number of additional isolated states K , and interpolation coefficient β for our methods on the development set. We provide more details of hyper-parameter tuning for FA-RNNs and all the baselines in Appendix.D.

Datasets
We evaluate the performance of our methods on three text classification datasets that have been used in previous work of integrating REs and neural networks: ATIS (Hemphill et al., 1990), Question Classification (QC) (Li and Roth, 2002) and SMS (Alberto et al., 2015). ATIS is a popular dataset consisting of queries about airline information and services. QC contains questions that can be classified into general categories like LOCATION, EN-TITY, etc. SMS is a spam-classification dataset. We write REs for ATIS and use a modified version of REs from Awasthi et al. (2020) for QC and SMS. We show dataset statistics and RE examples in Table 3.

Baselines
Basic Networks We compare FA-RNN with traditional recurrent neural networks including RNN (Elman, 1990), GRU (Chung et al., 2014), LSTM (Hochreiter and Schmidhuber, 1997), and their bidirectional variants. We also experiment with a 4layer CNN (Kim, 2014) and a 4-layer DAN (Iyyer et al., 2015), which are also frequently used in text classification. We feed the hidden representation produced by these models into an MLP to obtain the label logits and use the cross-entropy loss as the objective function. We tune the learning rates and the number of hidden states in [50,100,150,200] on the development set for each dataset.

RE-enhanced Basic Networks
We also compare our method with the basic neural networks enhanced by existing methods of combining rules and neural networks. Luo et al. (2018) propose three ways to utilize RE matching results in a neural model: 1) use the results as additional input features; 2) use the results to guide attention; 3) use the results to directly tune the output logits. As our basic networks do not involve attention, we enhance them using 1), 3) or both, denoted as +i, +o and +io respectively. Another method of utilizing rules is the knowledge distillation framework (Hinton et al., 2015). It treats the RE system as the teacher and its label logits as the soft targets, and distills this knowledge into the basic networks. We denote this method as +kd. Hu et al. (2016) combines knowledge distillation with posterior regularization by iteratively projecting the student network into the rule-regularized space. We denote this method as +pr. Finally, in the zero-shot setting, we also enhance these baselines by training them using unlabeled data tagged by regular expressions. We denote this enhancement by +u.

Zero-Shot Classification
We compare our methods with the RE system and RE-enhanced BiGRU in the zero-shot scenario, in which no training data (including the development set) is available. All the methods use or are initialized by exactly the same set of REs. For the +u enhancement, we use the full training data with their labels removed as unlabeled data. We show the results in Table 4.
The results show that our methods are comparable to the RE system. The small differences in accuracy between the RE system and our methods are caused by approximation errors in decomposing the parameter tensor and integrating word embedding, as well as the introduction of gates in FA-GRUs. Our methods have much better performance than RE-enhanced BiGRUs, because RE-enhanced Bi-GRUs without training perform random guesses, except that in the cases of +o and +io, RE matching results directly influence the outputs and hence improve the predictions. Baselines with the +u enhancement can also match the accuracy of the RE system, but unlike our methods, they require training on sufficient RE-labeled data.
We also report the results of the other baselines in Appendix.E. Without any training, the basic networks not enhanced by RE just perform random guesses. The other RE-enhanced basic networks have similar behaviors to RE-enhanced BiGRUs.

Low-Resource and Full Training
We compare all the methods trained on 1%, 10%, and 100% of the training data. We use the original development set for the 10% and 100% experiments; but for the 1% experiment, we sample a smaller development set containing the same   amount of data as 1% of the training data to simulate the low-resource setting. Table 5 shows the results. Because of space limit, for RE-enhanced networks, we only report the results of RE-enhanced BiGRUs, which perform the best among all the REenhanced networks. The complete results of all the methods with standard deviations can be found in Appendix.E.
From the results we can see that our methods outperform all the other methods in the low-resource settings, especially on 1% training data. With 100% training data, overall our methods are much better than RNN, DAN and CNN, and are either slightly better than or comparable to BiLSTM, BiGRU, and RE-enhanced BiGRUs. RE-enhanced BiGRUs are indeed better than non-enhanced BiGRU in general, and +pr and +kd seem to be more data-hungry than +i, +o and +io.
SMS is a binary classification task of spam detection, so we regard [spam] as the positive label and calculate the precisions and recalls of FA-RNN and two baselines GRU and GRU+io given different amounts of training data (Fig.2). With no training data, FA-RNN is almost equivalent to REs and has high precision but moderate recall; but with just 3% data, its recall is greatly improved and its precision drops only moderately; and with additional data, its precision and recall are both improved. For the baselines, the precision and recall of GRU always increase with more data, while the changes of the precision and recall of GRU+io seem less stable.

Analysis
Impact of β β from Eqa.6 controls the influence of pretrained word embedding. Fig.3 shows how β impacts the performance of FA-RNN on the zeroshot and fully-trained scenarios. It can be seen that using pretrained word embedding does not help in the zero-shot scenario but can be helpful in the fully-trained scenario. One possible explanation is that word similarities encoded in the pretrained word embedding may not be compatible with a classification task, but training with data could adapt the model (in particular, by updating G) to better utilize the information contained in the pretrained word embedding. Table 6 shows the results of variants of FA-RNN when trained on the full datasets. From the results we can conclude that: 1) forward scoring outperforms Viterbi scoring; 2) tensor decomposition described in Sec.3.2 results in not only fewer parameters but also better overall performance; 3) RE initialization is helpful because it is much better than random initialization; 4) integrating pretrained word embedding is beneficial because the performance drops by a large margin if using random word embedding; 5) training E R does not result in better performance on average, possibly because it introduces too many trainable parameters.

Interpretability
We regard the approximate equivalence between RE/WFA and our RE-initialized model as indication of good interpretability based on the following two reasons: 1) for people who are familiar with REs and automata, our model is interpretable once converted back into a WFA; 2) for non-experts who are unfamiliar with automata and REs, we may run a RE/WFA of a specific classification label on an input sentence and show which part of the sentence contributes to the (best) matching of the RE/WFA with the sentence, which can be easily understood by non-experts.
Note that not only can an FA-RNN be easily converted back into a RE/WFA at initialization, but the conversion can also be done after training. We can use the trained parameters of the FA-RNN Θ RE = Ê R ,D 1 ,D 2 ,Ĝ and word embedding matrix E w to reconstruct the WFA tensor T .
whereT (1) denotes the mode-1 unfolding of the reconstructed tensorT and denotes the Khatri-Rao product. Further, we can use a thresholding function f (x) = 1{x ≥ γ} to convert the weights into {0, 1} to recover an FA, where γ is a fixed scalar. Similarly, we can round the weights in the soft aggregation layer to reconstruct the logical aggregation layer. In this way, we can convert a trained FA-RNN back into an RE system. In our experiments, we find that although the reconstructed RE systems underperform the corresponding trained FA-RNNs because of thresholding and rounding during reconstruction, they often outperform the original REs. The reconstructed RE systems achieve 73.6% accuracy for QC (+9.2% compared with the original REs) and 87.45% for ATIS (+0.45% compared with the original REs). For SMS, the reconstructed REs underperform the original ones (−1.2%) probably because the original REs are already good enough. We show an example in Fig.4, in which our model can be seen to learn interesting new patterns such as 'jet' and '737'. We show another example in Appendix.F. Good interpretability of our models opens the possibility of fine-grained manipulation of the model, e.g., adding new REs without retraining the model. To inject a new set of REs, we convert them to a new FA-RNN with parameters Θ new = E R , D 1 , D 2 , G and merge it into the original trained FA-RNN with parameters Θ RE by concatenating the parameter matrices: To add new logical aggregation rules, we can update the aggregation layer parameters similarly by concatenation. To disable an RE in an FA-RNN, we reconstruct the WFA, remove all the states of the RE from the WFA except those that can be reached from states of other REs, and finally convert the WFA back to an FA-RNN.   Giles et al. (1999) show the equivalence between WFA and second-order RNN. The main differences between our model and theirs include the following. First, compared with the undecomposed version of our FA-RNN, their RNN model involves nonlinear activation functions which complicate the model. Second, our FA-RNN further decomposes the tensor parameter, integrate word embeddings, and propose the gated and bidirectional extensions. Third, while their work is mostly theoretical, we empirically show the usefulness of our model in text classification.

Conclusion and Future Work
We propose a type of recurrent neural networks called FA-RNN. It can be initialized from REs and can also learn from data, hence applicable to various scenarios including zero-shot, cold-start, low-resource and rich-resource scenarios. It is also interpretable and can be converted back into REs. Our experiments on text classification show that it outperforms previous neural approaches in both zero-shot and low-resource scenarios and is very competitive in rich-resource scenarios. In the future, we plan to apply FA-RNN to other tasks and explore other variants of FA-RNN. We release our data, RE rules and code at https://github.com/ jeffchy/RE2RNN.

A Tensor Rank Decomposition (CPD)
A 3-way tensor T ∈ R d 1 ×d 2 ×d 3 can be approximated using r rank-1 tensors.
T (1) denotes the mode-1 unfolding of tensorT . denotes the Khatri-Rao product while ⊗ denotes the outer product. When the rank of T is less than or equal to r, then the decomposition can be made exact.

B Tricks for CPD
Speeding up CPD Decomposing the WFA tensor T V ×K×K is hard when the vocabulary size V is large. However, if we neglect the wildcard '$', other words appear in RE usually form a small subset Σ of the whole vocabulary. Denote V 1 the size of Σ , we can use a wildcard matrix W ∈ R K×K and a much smaller tensor T V 1 ×K×K 1 to represent T . W [i, j] = 1 when the WFA transits from s i to s j in respond to '$', otherwize 0. Similarly, T 1 [σ, i, j] = 1 when the WFA transits from s i to s j in respond to Σ σ , otherwize 0. By this construc- Because Σ is small, it is much easier to decompose T 1 to get E R ∈ R V 1 ×r , D 1 ∈ R K×r , D 2 ∈ R K×r . After obtaining these matrices, we pad the E R back into a matrix E R sized V × r with 0s,  such that Let v t ∈ R r be the embedding of word x t contained in E R , the recurrent update of FA-RNN now becomes: We do not train the wildcard matrix W by default.
The new recurrent update will get exactly same result as the one without this trick.
Normalizing E R , D 1 and D 2 We find normalizing E R , D 1 and D 2 to ensure they have similar average Frobenius norm results in better performance of our methods. The average Frobenius norm is the Frobenius norm divided by the number of matrix elements. Denote a, b and c the average Frobenius norms for E R , D 1 and D 2 respectively, and y = (abc) 1/3 . We can normalize E R , D 1 and D 2 by multiplying them with the factors y/a, y/b and y/c respectively. The tensor reconstructed by the normalized matrices is the same as the tensor reconstructed by the original ones.

C Number of Parameters
We show calculation of model parameters of our FA-RNN and traditional recurrent neural networks in Table 7. K is the number of WFA states, r is the tensor decomposition rank, D is the word embedding dimension, and H is the hidden dimension in recurrent neural networks. In most cases, K, r are smaller than or comparable to H. Table 8 shows the numbers of trainable parameters. The model sizes are tuned and selected using the development set. The parameters associated with the aggregation layer or the MLP layer are also included. The result shows that our methods usually have fewer trainable model parameters than baselines.   20,50], for +pr, +kd, we select the α from [0.3, 0.5, 0.7], it is used for balancing between imitating the teacher and predicting the true hard labels. We select the best hyperparameters for each methods based on the averaged development set accuracy. Table 9, 10, 11 show the full experimental results with standard deviations. We run each model under each setting for four times with different random seeds. The standard deviations are large on low-resource scenarios because we also randomly choose the training data.

F Additional Interpretability Example
We present a more complicated example of original and reconstructed REs from the ATIS dataset in Fig.5. The trained RE contains a more sophisticated pattern with more transitions and a slightly different structure.