Benchmarking Multimodal Regex Synthesis with Complex Structures

Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.


Introduction
Regular expressions (regexes) are known for their usefulness and wide applicability, and yet they are hard to understand and write, even for many programmers (Friedl, 2006). Recent research has therefore studied how to construct regexes from natural language (NL) descriptions, leading to the emergence of NL-to-regex datasets including   KB13 (Kushman and Barzilay, 2013) and NL-TURK (Locascio et al., 2016). However, KB13 is small in size, with only 814 NL-regex pairs with even fewer distinct regexes. Locascio et al. (2016) subsequently employed a generate-and-paraphrase procedure (Wang et al., 2015) to create the larger NL-TURK dataset. However, the regexes in this dataset are very simple, and the descriptions are short, formulaic, and not linguistically diverse because of the paraphrasing annotation procedure (Herzig and Berant, 2019). As a result, even when models achieve credible performance on these datasets, they completely fail when evaluated on the STACKOVERFLOW dataset (Ye et al., 2019), a real-world dataset collected from users seeking help on StackOverflow. The limited size of this dataset (only 62 NL-regex pairs) makes it (a) I need to validate the next pattern: starts with "C0" and finish with 4 digits exactly. and(startwith(<C0>)),endwith(rep(<num>,4))) (b) i need regular expression for : one or two digits then "." and one or two digits. concat(reprange(<num>,1,2),concat(<.>,reprange(<num>,1,2))) (c) The input will be in the form a colon (:) separated tuple of three values. The first value will be an integer, with the other two values being either numeric or a string. concat(repatleast(<num>,1),rep(concat(<:>,or(repatleast(<let>,1), repatleast(<num>,1))),2)) Figure 2: Examples of complex regexes from STACKOVERFLOW. Each regex can be viewed as a set of components composed with a high-level template. Regex (a), for example, can be as viewed the intersection of two constraints specifying the characteristics of the desired regex. (rep means repeat).
unsuitable for large-scale training, and critically, the complexity of regexes it features means that regex synthesis systems must leverage the userprovided positive and negative examples (strings that should be matched or rejected by the target regex) in order to do well.
To enable the development of large-scale neural models in this more realistic regex setting, we present STRUCTUREDREGEX, a new dataset of English language descriptions and positive/negative examples associated with complex regexes. Using a new data collection procedure (Figure 1), our dataset addresses two major limitations in NL-TURK. First, we generate our regexes using a structured probabilistic grammar which includes macro rules defining high-level templates and constructions that involve multiple basic operators. These grammar structures allow us to sample more realistic regexes, with more terminals and operators, while avoiding vacuous regexes. By contrast, the random sampling procedure in NL-TURK leads to simple regexes, and attempting to sample more complex regexes results in atypical regex structures or even contradictory regexes that do not match any string values (Ye et al., 2019). Second, to achieve more realistic language descriptions, we prompt Turkers to write descriptions based on abstract figures that show the desired regexes. We design a set of visual symbols and glyphs to draw a given regex with minimal textual hints. We thereby avoid priming Turkers to a particular way of describing things, hence yielding more linguistically diverse descriptions.
Using this methodology, we collect a total of 3,520 English descriptions, paired with ground truth regexes and associated positive/negative examples. We conduct a comprehensive analysis and demonstrate several linguistic features present in our dataset which do not occur in past datasets. We evaluate a set of baselines, including grammar-based methods and neural models, on our dataset. In addition, we propose a novel decoding algorithm that integrates constrained decoding using positive/negative examples during inference: this demonstrates the potential of our dataset to enable work at the intersection of NLP and program synthesis. The performance of the best existing approach on STRUCTUREDREGEX only reaches 37%, which is far behind 84% on NL-TURK. However, this simple model can nevertheless solve 13% of the STACKOVERFLOW dataset, indicating that further progress on this dataset can be useful for real-world scenarios.

Structured Regex Generation Process
We first describe the structured generative process we adopt to produce the regexes in our dataset. For better readability, we denote regexes using a domain specific language (DSL) similar to regex DSLs in prior work (Locascio et al., 2016;Ye et al., 2019). Our DSL has the same expressiveness as a standard regular language and can be easily mapped back to standard regular expressions. 2 To collect the NL-TURK dataset, Locascio et al. (2016) sampled regexes using a hand-crafted grammar similar to a standard regex DSL. However, regexes sampled from this process can easily have conflicts (e.g. and(<let>,<num>)) or redundancies (e.g. or(<let>,<low>)). One solution to this problem is rejection sampling, but this still does not yield regexes with compositional, realworld structure.

Comp
Literal | or (Literal,Literal,...) # literals like digits, letters, strings, or set of literals. rep(Expr,k) | repatleast(Expr,k) | reprange(Expr,k,k) # e.g, 3 digits, 2 -5 letter, etc. optional (Comp) # components can be optional. Figure 3: Examples of our top-level templates and how they cover the three regexes in Figure 2, and overview of sub-regexes (in table) that can possibly be derived from Cons and Comp. Expr as a category here indicates various different constrained sets of sub-regexes. More detail about this structure is available in the full grammar in the appendix.
A string of numbers and digits that must start with a number except "0".
list of three segments delimited by a constant. We observe that these three templates actually capture a wide range of possible regex settings. The first, for example, handles password validation-esque settings where we have a series of constraints to apply to a single string. The second and third reflect matching sequences of fields, which may have shared structured (regex (c)) or be more or less independent (regex (b)).

Structured Grammar
To generate realistic regexes in these forms, we rely on a structured hand-crafted grammar. The top level of our grammar specifies three templates distilled from STACKOVERFLOW examples: IN-TERSECTION, CONCATENATION, and SEPARA-TION, which mimic patterns of real-world regexes.
In Figure 3, we show how regexes in Figure 2 can be derived from our templates. The INTERSEC-TION template (left) intersects several base constraints with the and operator; the CONCATENA-TION template (middle) concatenates several base components with the concat operator. SEPARA-TION (right) is a more complex type, generating a list of constant-separated INTERSECTION or CONCATENATION regexes which may be identical or share common components. Across all templates, the components are subregexes falling into a few high-level types (notably Cons and Comp), which are depth-limited to control the overall complexity (discussed in Appendix B.2). To make these component regexes more realistic as well, we design several macro rules that expand to more than one operator. The macros are also extracted from real-world examples and capture complex relations like adversative ( Figure 4) and conditional (Table 2) relations.
Although our hand-crafted grammar does not cover every possible construction allowed by the regular expression language, it is still highly expressive. Based on manual analysis, our grammar covers 80% of the real-world regexes in STACK-OVERFLOW, whereas the grammar of NL-TURK only covers 24% (see Section 4). Note that some constructions apparently omitted by our grammar are equivalent to ones supported by our grammar: e.g., we don't allow a global startwith constraint in the CONCATENATION template, but this constraint can be expressed by having the first component of the concatenation incorporate the desired constraint.

Sampling from the Regex Grammar
Although our structural constraints on the grammar already give rise to more realistic regexes, we still want to impose further control over the generative process to mimic properties of real-world regexes. For example, there are sometimes repeating components in CONCATENATION regexes, such as regex (b) from Figure 2.
We encourage such regexes by dynamically modifying the probability of applying the grammar rules while we are expanding a regex based on the status of the entire tree that has currently been induced. For example, suppose we are building regex (b) from Figure 2, and suppose we currently have concat(reprange(<num>, where Comp is a non-terminal that needs to be expanded into a sub-regex. Because we already have reprrange(<num>,1,2) and <.> in the current tree, we increase the probability of expanding Comp to generate these particular two sub-regexes, allowing the model to copy from what it has generated before. 3 In addition to copying, we also change the sampling distribution when sampling children of certain grammar constructs to control for complexity and encourage sampling of valid regexes. For example, the child of a startwith expression will typically be less complex and compositional than the child of a Comp expression, so we tune the probabilities of sampling compositional AST operators like or appropriately.

Positive/Negative Example Generation
The STACKOVERFLOW dataset (Ye et al., 2019) shows that programmers often provide both positive and negative examples to fully convey their intents while specifying a complicated regex. Therefore, we augment our dataset with positive and negative examples for each regex. Our model will use these examples to resolve ambiguity present in the natural language descriptions. However, the examples can also help Turkers to better understand the regexes they are describing during the data collection process. We aim to generate diverse and distinguishing examples similar to human-written ones, which often include corner cases that differentiate the ground truth regex from closely-related spurious ones. We can achieve this by enumerating examples that cover the states in the deterministic finite automaton (DFA) defined by the given regex 4 and reject similar but incorrect regexes. We employ the Automaton Library (Møller, 2017)  For negative examples, randomly sampling examples from the negation of a given regex will typically produce obviously wrong examples and not distinguishing negative examples as desired. Therefore, we propose an alternative approach shown in Figure 5 for generating negative examples. We apply minor perturbations to the ground truth regex to cause it to accept a set of strings that do not intersect with the set recognized by the original regex. The negative examples can be derived by sampling a positive string from one of these "incorrect" regexes.
For each regex in our dataset, we generate 6 positive examples and 6 negative examples. These numbers are comparable to the average number of examples provided by STACKOVERFLOW users.

Figure Generation
As stated previously, we avoid the paradigm of asking users to paraphrase machine-generated regex descriptions, as this methodology can yield formulaic and artificial descriptions. Instead, we ask users to describe regexes based on figures that illustrate how the regex is built. We show one example figure of a SEPARATION regex in Figure 6. In general, we abstract a given regex as a series of blocks linked with textual descriptions of its content and constraints. For instance, startwith and endwith are denoted by shading the head or tail of a block. By linking multiple blocks to shared tex-Three comma separated segments. The first segment is 2 digits. The other two consist of digits or letters but must start with a letter and contain "0". tual descriptions, we hope to encourage Turkers to notice the correlation and write descriptions accordingly. Finally, we have different textual hints for the same concept: "contain x" in Figure 6 may appear as "have x" elsewhere. These figures are rendered for each regex in the MTurk interface using JavaScript.

Crowdsourcing
Task We collected the STRUCTUREDREGEX dataset on Amazon Mechanical Turk (MTurk). For each HIT, the Turkers are presented with a regex figure and a set of positive/negative examples. Then, they are asked to write down several sentences describing the regex, as well as one additional positive example that matches the regex.
We only accept a description if the submitted positive example is matched by the ground-truth regex; this helps filter out some cases where the Turker may have misunderstood the regex. We show an example HIT in Appendix C.
In early pilot studies, we explored other ways of abstractly explaining regexes to Turkers, such as providing more examples and an associated set of keywords, yet none of these methods led to users generating sufficiently precise descriptions. By contrast, our figures fully specify the semantics of the regexes while only minimally biasing Turkers towards certain ways of describing them.
We generated 1,200 regexes (400 from each template), assigned each regex to three Turkers, and collected a total of 3,520 descriptions after rejecting HITs. In general, each Turker spent 2 to 3 minutes on each of the HITs, and we set the reward to be $0.35. The total cost of collecting our dataset was $1,512, and the average cost for each description is $0.43. Quality To ensure the quality of collected responses, we require the Turkers to first take a qualification test which simply requires describing one regex that we have specified in advance. We then check that the description for this regex is sufficiently long and that it contains enough of our manually-written correct base regex concepts. We manually observed from the responses that various styles were adopted by different Turkers for describing the same type of regexes. For instance, given regex (b) in Figure 2, some Turkers tend to enumerate every component in order, describing it as one or two digits followed by a dot followed by one or two digits; some other Turkers prefer grouping identical components and describing the components out of order, describing it as the first and third parts are one or two digits, and the second part is a dot. These distinct styles lead to a diversity of linguistic phenomena, which is further analyzed in Section 4. Because we aim for high linguistic diversity in our dataset, we prohibited a single Turker from doing more than 300 HITs.
Furthermore, we found anecdotal evidence that the task was engaging for users, which we took as a positive signal for generation quality. We received messages about our HITs from some Turkers telling us that our HIT was "really interesting" and they "enjoyed doing it." Splitting the Dataset Since our dataset consists of natural language descriptions written by annotators, there is possibly bias introduced by training and testing on the same annotators (Geva et al., 2019). Therefore, in addition to the standard Train/Development/Test splits, we also form a Test-E (excluded) which consists only of annotations from annotators unseen in the training set. We ensure that Train, Dev, and both two test sets (Test and Test-E) have mutually exclusive regexes from each other (Test and Test-E can have common regexes), and Test-E is annotated entirely by TURK STREG Example NL from STREG multi-sentence 0% 70% The string has 6 or more characters. The string must start with a digit. ambiguity 2.3% 20.6% The sequence starts with a letter followed by 2 numbers. abstraction 0% 13.3% The first part of a single string consists of 1 or more "0" followed by 2 capital letters. The second part of the string must follow the same rules. non-local constraint 0% 16.7% There are 3 dash separated strings. The first is 1 to 4 "A" . The second and third consist of 1 or 2 "x" followed by 1 to 3 numbers and 2 letters. coreference 5.1% 29.7% The string starts with a number. It ends with 1 to 4 lower or capital letters.
condition relation 0% 3.5% If there is a capital letter it must be after a digit. adversative relation 0% 3.7% The string start with capital letter but it should not be a "A".

Dataset Analysis
We demonstrate the advantages of our dataset over prior datasets (Kushman and Barzilay, 2013;Locascio et al., 2016) through both quantitative and qualitative analysis. We list the key statistics of our dataset as well as KB13 and NL-TURK for comparison in Table 1. Compared to past synthetic datasets, our dataset has more diverse and sophisticated language. The average NL length of our dataset is twice as long as that of NL-TURK, and the descriptions contain many more unique words even though our dataset contains fewer regexes. In addition, our dataset contains more complex regexes that are closer to the complexity of real-world regexes found on StackOverflow, whereas regexes in previous datasets are significantly simpler.
Manual Analysis We further manually analyze 150 descriptions from past synthetic datasets and our dataset. and examples with nontrivial coreference. The language from our dataset is organic and diverse, since we allow Turkers to compose their own descriptions. We find that macros and complex constraints in our structured grammar can successfully trigger interesting language. For instance, the abstraction reflects repetition in concatenation regexes, and the bottom part of Table 2 Table 3: Distribution mismatch analysis with respect to STACKOVERFLOW on past datasets and our dataset. Our dataset covers significantly more words and regexes, and is closer to the real-world dataset.
complex macros. Furthermore, the complex and ambiguous language highlights the necessity of including examples together with language to fully specify a regex. For instance, ambiguity is common in our descriptions. However, many of the ambiguous descriptions can be resolved with the help of examples.
Comparison to STACKOVERFLOW Since our goal was to produce realistic regex data, we analyze how well the real-world STACKOVER-FLOW dataset is covered by data from STRUC-TUREDREGEX compared to other datasets (Kushman and Barzilay, 2013; Locascio et al., 2016). We ignore 11 of the STACKOVERFLOW examples that involve the high-level decimal concept, which is beyond the scope of our dataset and past synthetic datasets. In addition, we anonymize all the constants and integer parameters (e.g., repeat(<x>,9) is anonymized as repeat(const,int)). The statistics (Table 3) suggest that our dataset is more highly similar to real-world regexes on StackOverflow, especially in terms of regex distribution.

Methods
We evaluate the accuracy of both existing grammar-based approaches and neural models, as well as a novel method that targets the multimodal nature of our dataset.

Existing
Approaches SEMANTIC-UNIFY (Kushman and Barzilay, 2013) is a grammarbased approach that relies on a probabilistic combinatory categorical grammar to build the regexes. DEEPREGEX (Locascio et al., 2016) directly translates natural language descriptions into regexes using a seq-to-seq model enhanced with attention (Luong et al., 2015) without considering examples. We re-implemented DEEPREGEX with slightly different hyperparameters; we refer to our re-implementation as DEEPREGEX (OURS). DEEPREGEX+FILTER (Ye et al., 2019) adapts DEEPREGEX so as to take examples into account by simply filtering the k-best regexes based on whether a regex accepts all the positive examples and rejects all the negative ones.
Example-Guided Decoding Although DEEP-REGEX+FILTER is able to take advantage of positive and negative string examples, these examples are completely isolated in the training and inference phase. We propose to make use of examples during inference with the technique of overand under-approximation (Lee et al., 2016) used in the program synthesis domain. The core idea of our approach is that, for each partially completed regex during decoding, we use the approximation technique to infer whether the regex can possibly match all positive or reject all negative examples. If this is impossible, we can prune this partial regex from our search. This approach allows us to more effectively explore the set of plausible regexes without increasing the computational budget or beam size.
As an example, consider the ground truth regex and(startwith(<low>),endwith(<num>)) with one corresponding positive example "00x". Suppose that the decoder has so far generated the incomplete regex and(startwith(<cap>),. To produce a syntactically valid regex, the decoder needs to generate a second argument for the and. By appending star(<any>) as its second argument, we can see that there is no completion here that will accept the given positive example, allowing us to reject this regex from the beam. Under-approximation works analogously,  Table 4: DFA-equivalent accuracy on prior datasets and our dataset. The performance on our dataset using any model is much lower than the performance on existing datasets.
completing regexes with maximally restrictive arguments and checking that negative examples are rejected.
We integrate the aforementioned technique in the beam decoding process by simply pruning out bad partial derivations at each timestep. We refer to this approach as DEEPREGEX + APPROX.

Comparison to Prior Datasets
We evaluate the baseline models on KB13, NL-TURK, and our dataset ( Table 4). The results show that our dataset is far more challenging compared to existing datasets. Traditional grammar baseline can scarcely solve our dataset. The best baseline, DEEPREGEX + FILTER, achieves more than 77.7% on KB13 and 83.8% NL-TURK when these datasets are augmented with examples, but can only tackle 37.2% of our dataset. Additionally, the comparison between DEEPREGEX and DEEP-REGEX + FILTER demonstrates that simply filtering the outputs of neural model leads to a substantial performance boost on all the datasets. This supports the effectiveness of the way we specify regexes, i.e., using both natural language descriptions and examples. Table 5 shows the detailed accuracy regarding different regex templates on both Test and Test-E sets. Our DEEPREGEX + APPROX achieves best accuracy with 5.6% and 7.9% improvement over DEEPREGEX + FILTER on Test and Test-E, respectively, since it can leverage examples more effectively using over-and under-approximations during search.

Detailed Results on STRUCTUREDREGEX
Accuracy varies on different types of regexes. Generally, models perform the best on concatenation regexes, slightly worse on intersection regexes, and the worst on separation regexes. Concatenation regexes usually have straightforward   descriptions in the form of listing simple components one by one. Intersection descriptions can be more complicated because of the high-level macros specified by our grammar. Separation descriptions are the most complex ones that often involve coreferences and non-local features. Performance on Test-E is 12% lower than on Test for the models haven't been trained on patterns of the unseen annotators.

Transferability Results
Finally, we investigate whether a model trained on our dataset can transfer to the STACKOVER-FLOW dataset. As in Section 4, we ignore instances requiring the decimal concept and only evaluate on the subset of STACKOVERFLOW with 51 instances. We compare our dataset with NL-TURK for this task. As shown in Table 6, DEEP-REGEX trained on NL-TURK completely fails on STACKOVERFLOW and even fails to predict reasonable regexes that are consistent with the examples. This is caused by the fact that the NL-TURK dataset contains formulaic descriptions and shallow regexes that are not representative of realworld tasks. DEEPREGEX trained on our dataset can at least achieve 9.8% accuracy on STACK-OVERFLOW dataset because the English descrip-tions in this dataset better match the desired task.
Our DEEPREGEX + APPROX model successfully solves 13.7% and finds consistent regexes for 38% of the tasks, which is credible given that the performance of the same model on Test-E set is only 30%. Some additional challenges in STACK-OVERFLOW are instances involving large numbers of constants or slightly more formal language since the SO users are mainly programmers. However, we believe the transfer results here show that improved performance on our dataset may transfer to STACKOVERFLOW as well, since some of the challenges also present in our Test-E set (e.g., unseen language).

Human Performance Estimate
It is difficult to hire Turkers to estimate a human performance upper bound, because our task requires reckoning with both the descriptions and positive/negative examples. Unlike many NLP tasks where an example with ambiguous language is fundamentally impossible, here the examples may actually still allow a human to determine the correct answer with enough sleuthing. But to perform this task, crowdworkers would minimally need to be trained to understand the DSL constructs and how they compose, which would require an extensive tutorial and qualification test.
To do the task well, Turkers would need a tool to do on-the-fly execution of their proposed regexes on the provided examples. We instead opted for a lighter-weight verification approach to estimate human performance. We adopted a post-editing approach on failure cases from our model, where we compared the model's output with the input description and examples and corrected inconsistencies.
Specifically, we sample 100 failure examples from the test set (Test plus Test-E) and manually assess the failure cases. We find 78% of failure cases contain descriptions that describe all com-ponents of the target regexes, but our seq-to-seq models are insufficient to capture these. There are truly some mis-or under-specified examples, such as not mentioning the optionality of one component or mistaking "I" for "l" in constants. An additional 9% (out of 100) of the errors could be fixed using the provided examples. This leaves roughly 13% of failure cases that are challenging to solve.
Considering that the model already achieves 43.6% accuracy on the test set, we estimate human performance is around 90%. 5 7 Related Work Data collection in semantic parsing Collecting large-scale data for semantic parsing and related tasks is a long-standing challenge (Berant et al., 2013;Wang et al., 2015). Wang et al. (2015) proposed the generate-and-paraphrase framework, which has been adopted to collect datasets in various domains (Locascio et al., 2016;Ravichander et al., 2017;Johnson et al., 2017). However, this process often biases annotators towards using formulaic language (Ravichander et al., 2017;Herzig and Berant, 2019).
Similar to our work, past work has sought to elicit linguistically diverse data using visual elements for semantic parsing (Long et al., 2016), natural language generation (Novikova et al., 2016), and visual reasoning (Suhr et al., 2017(Suhr et al., , 2019. However, for these other tasks, the images used are depictions of an inherently graphical underlying world state; e.g., the NLVR dataset (Suhr et al., 2017) and NLVR2 (Suhr et al., 2019) are based on reasoning over the presented images, and the Tangrams dataset (Long et al., 2016) involves describing shape transformations. By contrast, regexes are typically represented as source code; there is no standard graphical schema for depicting the patterns they recognize. This changes the properties of the generated descriptions, leading to higher levels of compositionality and ambiguity because what's being described is not naturally an image.
Program and regex synthesis Recent research has tackled the problem of program synthesis from examples (Gulwani, 2011;Gulwani and Jain, 5 In addition, the first author manually wrote regexes for 100 randomly sampled examples and achieved an accuracy of 95% (higher than the estimate). However, the author also has a strong prior over what synthetic regexes are likely to be in the data. Alur et al., 2013;Wang et al., 2016;Feng et al., 2018;Devlin et al., 2017;Nye et al., 2019). A closer line of work to ours uses both examples and natural language input (Yaghmazadeh et al., 2017;Ye et al., 2019;Andreas et al., 2018), which involves fundamentally different techniques. However, our work does not rely on the same sort of program synthesizer to build final outputs (Yaghmazadeh et al., 2017;Ye et al., 2019). Moreover, Andreas et al. (2018) only use language at train time, whereas we use NL at both train and test time.

Conclusion
We introduce STRUCTUREDREGEX, a new dataset for regex synthesis from natural language and examples. Our dataset contains compositionally structured regexes paired with linguistically diverse language, and organically includes distinguishing examples. Better methods are needed to solve this dataset; we show that such methods might generalize well to real-world settings.

B.2 Implementation Details
Intersection While building INTERSECTION regexes, we impose context-dependent constraints mainly to avoid combinations of regexes that are redundant or in conflict. Conflicts often occur between a ComposedBy constraint and the other constraints.
A ComposedBy constraint indicates the allowed characters; e.g., repeatatleast(or(<let>,<spec>),1) means there can only be letters and special characters in the matched string. Therefore, when we already have such a constraint in the tree, we only allow the terminals to be selected from the valid subset of <let> and <spec> while expanding the other subtrees.
Concatenation CONCATENATION regexes are a sequence of simple components. As stated above, our grammar encourages the phenomenon of repetition that commonly occurs in real regexes by copying existing sub-trees.
Separation SEPARATION regexes have several subfields, which can be specified by either INTER-SECTION regexes or CONCATENATION regexes, and which are delimited by a constant. The fields of real regexes are often related, i.e., they share common components. For instance, the format of U.S. phone numbers is "xxx-xxx-xxxx" where "x" is a digit. Here the three fields are all digits but differ in length. Similar to the CONCATENATION template, we alter the distribution so as to copy the already generated subtrees.
We also allow a class of SEPARATION with an arbitrary number of identical fields separated by a constant (e.g., a list of comma-separated numbers).
Complexity Control We aim to create a collection of complicated regexes, but we do not wish to make them needlessly complex along unrealistic axes. We assess the complexity of generated regexes using a measure we call semantic complexity, which roughly measures how many factors would need to be specified by a user. Generally, each constraint or components counts for one degree of semantic complexity, e.g., not(contain(x)) and repeat(x,4) are of complexity level one. High-level macro constraints are of complexity level two since they need more verbal explanation. We limit the complexity degrees all of our generated regexes to be strictly no more than six. More details about the number of nodes and depth of our regexes can be found in Section 4.

C HIT Example
See Figure 8.

Instructions:
In this task, you will be writing down descriptions of the patterns you see in a group of strings. For each HIT, you'll be given a figure visually specifying a pattern and a few examples of strings following or not following the pattern to help you to understand it. Please write a description (generally 1-4 sentences) that describes the pattern. In addition, please write one additional string that follows the pattern. Things to keep in mind: • Please describe the pattern underlying the string examples, not the sequence of strings itself. Do not write things like "the first line ..., the second line ...." • Try to be precise about describing the pattern, but also concise. Don't describe the same property of the strings in multiple ways.
• You are not required to use the keywords in the figure. If you can think of another way to express the intent, that's okay.
• Please try to write natural and fluent sentences.
• Additional string example must be different.
Example strings that follow the pattern: a51,B457 a74,B23 a09,849 Example strings that do not follow the pattern: b55,B193 a7,B23 a09,1 Figure 8: HIT prompt for the description writing task. We particularly emphasize in the instructions that Turkers should use precise and original language.