Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge

This paper explores the task of translating natural language queries into regular expressions which embody their meaning. In contrast to prior work, the proposed neural model does not utilize domain-specific crafting, learning to translate directly from a parallel corpus. To fully explore the potential of neural models, we propose a methodology for collecting a large corpus of regular expression, natural language pairs. Our resulting model achieves a performance gain of 19.6% over previous state-of-the-art models.


Introduction
This paper explores the task of translating natural language text queries into regular expressions which embody their meaning. Regular expressions are built into many application interfaces, yet most users of these applications have difficulty writing them (Friedl, 2002). Thus a system for automatically generating regular expressions from natural language would be useful in many contexts. Furthermore, such technologies can ultimately scale to translate into other formal representations, such as program scripts (Raza et al., 2015).
Prior work has demonstrated the feasibility of this task. Kushman and Barzilay (2013) proposed a model that learns to perform the task from a parallel corpus of regular expressions and the text descriptions. To account for the given representational disparity between formal regular expressions and natural language, their model utilizes a domain specific component which computes the semantic equivalence between two regular expressions. Since their model relies heavily on this component, it cannot be readily applied to other formal representations where such semantic equivalence calculations are not possible.
In this paper, we reexamine the need for such specialized domain knowledge for this task. Given the same parallel corpus used in Kushman and Barzilay (2013), we use an LSTM-based sequence to sequence neural network to perform the mapping. Our model does not utilize semantic equivalence in any form, or make any other special assumptions about the formalism. Despite this and the relatively small size of the original dataset (824 examples), our neural model exhibits a small 0.1% boost in accuracy.
To further explore the power of neural networks, we created a much larger public dataset, NL-RX. Since creation of regular expressions requires specialized knowledge, standard crowd-sourcing methods are not applicable here. Instead, we employ a two-step generate-and-paraphrase procedure that circumvents this problem. During the generate step, we use a small but expansive manually-crafted grammar that translates regular expression into natural language. In the paraphrase step, we rely on crowd-sourcing to paraphrase these rigid descriptions into more natural and fluid descriptions. Using this methodology, we have constructed a corpus of 10,000 regular expressions, with corresponding verbalizations.
Our results demonstrate that our sequence to sequence model significantly outperforms the domain specific technique on the larger dataset, reaching a 1918 gain of 19.6% over of the state-of-the-art technique.

Related Work
Regular Expressions from Natural Language There have been several attempts at generating regular expressions from textual descriptions. Early research into this task used rule-based techniques to create a natural language interface to regular expression writing (Ranta, 1998). Our work, however, is closest to Kushman and Barzilay (2013). They learned a semantic parsing translation model from a parallel dataset of natural language and regular expressions. Their model used a regular expressionspecific semantic unification technique to disambiguate the meaning of the natural language descriptions. Our method is similar in that we require only description and regex pairs to learn.r However, we treat the problem as a direct translation task without applying any domain-specific knowledge.
Neural Machine Translation Recent advances in neural machine translation (NMT) (Bahdanau et al., 2014;Devlin et al., 2014) using the framework of sequence to sequence learning (Sutskever et al., 2014) have demonstrated the effectiveness of deep learning models at capturing and translating language semantics. In particular, recurrent neural networks augmented with attention mechanisms (Luong et al., 2015) have proved to be successful at handling very long sequences. In light of these successes, we chose to model regular expression generation as a neural translation problem.

Regex Generation as Translation
We use a Recurrent Neural Network (RNN) with attention (Mnih et al., 2014) for both encoding and decoding ( Figure 1).
Let W = w 1 , w 2 ...w m be the input text description where each w i is a word in the vocabulary. We wish to generate the regex R = r 1 , r 2 , ...r n where each r i is a character in the regex.
We use Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) cells in our model, the transition equations for which can be summarized as: where σ represents the sigmoid function and is elementwise multiplication. i t refers to the input gate, f t is the forget gate, and o t is the output gate at each time step. The U and V variables are weight matrices for each gate while the b variables are the bias parameters. The input x t is a word (w t ) for the encoder and the previously generated character r t−1 for the decoder.
The attention mechanism is essentially a 'soft' weighting over the encoder's hidden states during decoding: where h e is a hidden state in the encoder and score is the scoring function. We use the general attention matrix weight (as described in (Luong et al., 2015)) for our scoring function. The outputs of the decoder r t are generated using a final softmax layer.
Our model is six layers deep, with one word embedding layer, two encoder layers, two decoder layers, and one dense output layer. Our encoder and decoder layers use a stacked LSTM architecture with a width of 512 nodes. We use a global attention mechanism (Bahdanau et al., 2014), which considers all hidden states of the encoder when computing the model's context vector. We perform standard dropout during training (Srivastava et al., 2014) after every LSTM layer with dropout probability equal to 0.25. We train for 20 epochs, utilizing a minibatch size of 32, and a learning-rate of 1.0. The learning rate is decayed by a factor 0.5 if evaluation perplexity does not increase.

Creating a Large Corpus of Natural Language / Regular Expression Pairs
Previous work in regular expression generation has used fairly small datasets for training and evaluation. Non-Terminals x & y → x and y x | y → x or y ∼(x) → not x .*x.*y → x followed by y .*x.* → contains x x{N,} → x, N or more times x& y& z → x and y and z x | y | z → x or y or z x{1,N} → x, at most N times x.* → starts with x .*x → ends with x \b x\b → words with x (x)+ → x, at least once (x)* → x, zero or more times x → only x  In order to fully utilize the power of neural translation models, we create a new large corpus of regular expression, natural language pairs titled NL-RX.
The challenge in collecting such corpora is that typical crowdsourcing workers do not possess the specialized knowledge to write regular expressions. To solve this, we employ a two-step generate-andparaphrase procedure to gather our data. This technique is similar to the methods used by Wang et al. (2015) to create a semantic parsing corpus.
In the generate step, we generate regular expression representations from a small manually-crafted grammar (Table 1). Our grammar includes 15 nonterminal derivations and 6 terminals and of both basic and high-level operations. We identify these via frequency analysis of smaller datasets from previous work (Kushman and Barzilay, 2013). Every grammar rule has associated verbalizations for both regular expressions and language descriptions. We use this grammar to stochastically generate regular expressions and their corresponding synthetic language descriptions. This generation process is shown in Figure 2.
While the automatically generated descriptions are semantically correct, they do not exhibit richness and variability of human-generated descriptions. To obtain natural language (non-synthetic) descriptions, we perform the paraphrase step. In this step, Mechanical Turk (Amazon, 2003) human workers paraphrase the generated synthetic descrip-  Table 1 are applied to a node's children and the resulting string is passed to the node's parent.  NL-RX Using the procedure described above, we create a new public dataset (NL-RX) comprising of 10,000 regular expressions and their corresponding natural language descriptions. Table 2 shows an example from our dataset.
Our data collection procedure enables us to create a substantially larger and more varied dataset than previously possible. Employing standard crowdsource workers to paraphrase is more cost-efficient and scalable than employing professional regex programmers, enabling us to create a much larger dataset. Furthermore, our stochastic generation of regular expressions from a grammar results in a more varied dataset because it is not subject to the bias of human workers who, in previous work, wrote many duplicate examples (see Results).
Corpora Statistics Our seed regular expression grammar (Table 1), covers approximately 85% of the original KB13 regular expressions. Additionally, NL-RX contains exact matches with 30.1% of the original KB13 dataset regular expressions. This means that 248 of the 824 regular expressions in the Verbalization Frequency 'the word x' 12.6% 'x before y' 9.1% 'x or y' 7.7% 'x, at least once' 6.2% 'a vowel' 5.3% KB13 dataset were also in our dataset. The average length of regular expressions in NL-RX is 25.9 characters, the average in the KB13 dataset is 19.7 characters. We also computed the grammar breakdown of our NL-RX. The top 5 occurring terminals in our generated regular expressions are those corresponding with the verbalizations shown in Table  3.
Crowdsourcing details We utilize Mechanical Turk for our crowdsource workers. A total of 243 workers completed the 10,000 tasks, with an average task completion time of 101 seconds. The workers proved capable of handling complex and awkward phrasings, such as the example in Table 2, which is one of the most difficult in the set.
We applied several quality assurance measures on the crowdsourced data. Firstly, we ensured that our workers performing the task were of high quality, requiring a record of 97% accuracy over at least 1000 other previous tasks completed on Mechanical Turk. In addition, we ran automatic scripts that filtered out bad submissions (e.g. submissions shorter than 5 characters). In all, we rejected 1.1% of submissions, which were resubmitted for another worker to complete. The combination of these measures ensured a high quality dataset, and we confirmed this by performing a manual check of 100 random examples. This manual check determined that approximately 89% of submissions were a correct interpretation, and 97% were written in fluent English.

Experiments
Datasets We split the 10,000 regexp and description pairs in NL-RX into 65% train, 10% dev, and 25% test sets.
In addition, we also evaluate our model on the dataset used by Kushman and Barzilay (2013) (KB13), although it contains far fewer data points (824). We use the 75/25 train/test split used in their work in order directly compare our performance to theirs.
Training We perform a hyper-parameter gridsearch (on the dev set), to determine our model hyper-parameters: learning-rate = 1.0, encoderdepth = 2, decoder-depth = 2, batch size = 32, dropout = 0.25. We use a Torch (Collobert et al., 2002) implementation of attention sequence to sequence networks from (Kim, 2016). We train our models on the train set for 20 epochs, and choose the model with the best average loss on the dev set.

Evaluation Metric
To accurately evaluate our model, we perform a functional equality check called DFA-Equal. We employ functional equality because there are many ways to write equivalent regular expressions. For example, (a|b) is functionally equivalent to (b|a), despite their string representations differing. We report DFA-Equal accuracy as our model's evaluation metric, using Kushman and Barzilay (2013)'s implementation to directly compare our results.
Baselines We compare our model against two baselines: BoW-NN: BoW-NN is a simple baseline that is a Nearest Neighbor classifier using Bag Of Words representation for each natural language description. For a given test example, it finds the closest cosinesimilar neighbor from the training set and uses the regexp from that example for its prediction.

Results
Our model significantly outperforms the baselines on the NL-RX dataset and achieves comparable performance to Semantic Unify on the KB13 dataset (Table 4). Despite the small size of KB13, our model achieves state-of-the-art results on this very resource-constrained dataset (824 examples). Using NL-RX, we investigate the impact of training data size on our model's accuracy. Figure 3 shows how

Differences in Datasets
Keeping the previous section in mind, a seemingly unusual finding is that the model's accuracy is higher for the smaller dataset, KB13, than for the larger dataset, NL-RX-Turk. On further analysis, we learned that the KB13 dataset is a much less varied and complex dataset than NL-RX-Turk. KB13 contains many duplicates, with only 45% of its regular expressions being unique. This makes the translation task easier because over half of the correct test predictions will be exact repetitions from the training set. In contrast, NL-RX-Turk does not suffer from this variance problem and contains 97% unique regular expressions. The relative easiness of the KB13 dataset is further illustrated by the high performance of the Nearest-Neighbor baselines on the KB13 dataset.

Conclusions
In this paper we demonstrate that generic neural architectures for generating regular expressions outperform customized, heavily engineered models. The results suggest that this technique can be employed to tackle more challenging problems in broader families of formal languages, such as mapping between language description and program scripts. We also have created a large parallel corpus of regular expressions and natural language queries using typical crowd-sourcing workers, which we make available publicly.