Regular Expression Guided Entity Mention Mining from Noisy Web Data

Many important entity types in web documents, such as dates, times, email addresses, and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.


Introduction
Named Entity Recognition (NER) is the task of automatically locating, extracting, and classifying contiguous pieces of strings, which represent entities of interest, in text. Classification (or typing) seeks to assign pre-defined categories (e.g., person, organization, location, expressions of time, monetary values, and emails) to each extracted piece of text. NER is a subtask of the broader problem of Information Extraction (IE) from text (Chang et al., 2006;Etzioni et al., 2005;Finkel and Manning, 2009;Nadeau and Sekine, 2007;Shen et al., 2015). Named entities usually refer to entity names that describe unique identifiers of people, locations, movies, events, and organizations. There is a large class of entities that are not "named," such as expressions of time, emails, and course identifiers. Their main characteristic is that they often follow an underlying syntactical pattern, which can be fully described or well approximated by Regular Expressions (REs).
Despite being the workhorse of many entity recognition tasks, REs have a number of drawbacks. The construction of highly accurate REs is difficult and requires specific technical skills. For a simple task such as recognizing emails, there are 361 REs proposed in RegExLib.com. Moreover, REs are brittle and difficult to maintain. These obstacles have motivated the work on automatic inference of REs (Banko et al., 2007;Li et al., 2008;Bartoli et al., 2018) where the objective it to develop approaches that are fast and deployable in real time. However, the existing approaches tend to require large number of examples to cover both the alphabet and the possible syntactic patterns. Moreover, they often produce overly complicated or long REs or combinations of REs (Bartoli et al., 2016). One of the most complete REs for emails has nearly 6,500 characters (Millner, 2008)! Web documents are an important domain for data extraction. Importantly, the Web is not a place where data follows "underlying syntactical pattern" at scale. A datetime RE for New York Times news articles may not work for the articles at Le Figaro or Al Jazeera. Small typos throw off REs and produce nothing. This is a concrete Web example data-timestamp="Thu Oct 05 2017 10:33:05 -0500"> that contains an unexpected space before "-0500". This small typo can easily deem a complex and painstakingly constructed datetime RE obsolete. At such scale, any attempt to fully understand an RE and debug it in case it fails is futile. New automated or semiautomated tools are needed to either supersede REs or to work in tandem with REs. In this work, we focus on the later.
In this paper, we target the problem of detecting presence of entity mentions that follow or closely resemble patterns that can be described by REs. Unlike much of the previous body of work on this topic, we do not focus on learning/inferring highly accuracte REs for entity identification (Prasse et al., 2012;Bui and Zeng-Treitler, 2014;Li et al., 2008;Banko et al., 2007;Brauer et al., 2011;Bartoli et al., 2016). We aim instead to show that deep learning can leverage imperfect REs and achieve very high accuracy while requiring only a modest human involvement.
Suppose the goal is to recognize datetime string expressions. We use some reasonable REs R for datetime to generate a weakly labeled training dataset from a large corpus of Web documents, e.g., news articles. We train a deep neural network on this data. Denote this model M RE . To our surprise, M RE is already capable of recognizing the presence of datetime expressions beyond those recognized by R. Furthermore, with the addition of a very small number of training samples (between 20 -50 instances) from a human labeler, we obtain a model M RE+human that is superior to M RE by a significant margin. In general, complex systems do not generalize easily with the addition of new data, because the amount of labeled data required to provide a good coverage grows exponentially with the complexity of the problem (Chiu and Nichols, 2015;Lample et al., 2016;Huang et al., 2015;Mahajan et al., 2018). We show that there is an opportunity for faster convergence to a generalized recognizer for this class of entities.
The main contributions in this paper are: • We show how starting from REs R that recognizes a fraction of entities of a given type E (say, email) we can pretrain a deep neural network (NN) model which can be a richer recognizer of entities of type E than R. • We show that we can fine tune the pretrained model to recognize an even larger set of entities of type E with the addition of a small number of labeled instances, as small as 20.
The paper is organized as follows. Section 2 gives an overview of the related work. Section 3 describes our method. Section 4 presents our methodology for parameter learning and experimental setup. Section 5 gives the experimental results. Finally, Section 6 concludes the paper.

Related Work
This section will mention several lines of research we deem the most related to our work.
The problem of inducing regular expressions has been an active area of work for more than two decades. One line of work focuses on improving the initial REs by identifying the true or false matches (Li et al., 2008;Murthy et al., 2012;Cetinkaya, 2007;Cochran et al., 2015). Another line of work attempts to directly induce REs from positive and negative sample strings (Fernau, 2009;Denis, 2001). The common approaches include generation of prefix and suffix automatons that represent overlapping syntactical features of the entities on character and token level (Brauer et al., 2011) and the automatic creation of REs based on genetic programming (Bartoli et al., 2012(Bartoli et al., , 2014(Bartoli et al., , 2016(Bartoli et al., , 2018. Constraining NN training to comply with known rules has also been an active research topic. Hu et al. (2016) proposes integration of constraints coming in the form of first order logic rules during training of NNs. Alashkar et al. (2017) trains an NN by minimizing a joint loss based on prediction of labels and adhering to the predefined rules. Locascio et al. (2016) proposed training LSTM NN to generate REs from sample pieces of text. Luo et al. (2018) incorporates knowledge of REs into training of NNs at three different levels: as the input features to NNs, as regularizations of the outputs of NN layers, or as a reward/penalty in the loss functions in NNs.
Unlike the aforementioned work, we do not attempt to learn explicit REs and do not force the outputs of NN layers match predetermined rules. Instead, we leverage REs as a means of generating a large quantity of weak labels from unlabeled data and using such data to pre-train an NN to recognize the provided REs. We fine tune such an NN on human-labeled data to exceed accuracy of the REs. A similar approach is effective in other domains. For example, Felbo et al. (2017) proposed pre-training an NN on millions of tweets labeled by emojis before fine tuning it for sentiment anal- ysis. Mahajan et al. (2018) pre-trained an NN on billions of images labeled by hashtags before fine tuning it to various computer vision tasks.

Methodology
We describe the proposed framework here.
Problem Definition: Given a text string t and an entity type E, the task is to predict whether t contains an entity mention of type E. We treat this task as binary classification. To build the classifier, we assume that we are given a large corpus of unlabeled text strings T = {t 1 , t 2 , ..., t n }. The challenge is to train the classifier with the minimal human effort. We allow a human expert to help in two ways: (1) construct a new RE or find an RE created by others, and (2) label an unlabeled string. For the purposes of this paper, we assume that one or more REs suitable for entity identification are already available and that human effort refers only to string labeling. The available REs might have an arbitrary precision and recall. We will analyze the impact of the RE quality on classification accuracy in the experimental section.
Solution Framework: An overview of the proposed framework is illustrated in Figure 1. STEP 1. All of the unlabeled strings from T are fed into an RE annotator. If any of the provided REs t i recognizes t i , it is weakly labeled as y i = 1, otherwise it is weakly labeled as y i = 0.
STEP 2. An NN model M RE is trained based on the weakly labeled data D RE = {(t i , y i )|i = 1, 2, ..., n}. Given a large number of sample strings, it is expected that we can train an NN with high accuracy on D RE . STEP 3. A subset of m strings from T are sampled randomly and a human annotator labels each of the sampled strings. String t i is labeled as y i = 1 if the annotator recognizes an entity type E in t i and as y i = 0, if not. We denote the resulting strongly labeled data set as D human = {(t i , y i )|i = 1, 2, ..., m}, where m n. STEP 4. The pre-trained NN M RE is fine tuned with D human data. We call the resulting NN M RE+human . For comparison, we also train a randomly initialized NN directly on D human . We call this NN M human . The expectation is that the pre-trained NN captures very useful information about the entity type E and that fine tuning is more effective than training a new NN from scratch.
RE Annotator: Let us denote the set of REs available for a specific entity type E as R. R may be either created by human experts or generated automatically by tools like (Li et al., 2008;Bartoli et al., 2018), which require a human-labeled subset of T . Both approaches are human-intensive.
Deep Learning Architectures: We do not have a preference over any deep learning architecture to train M RE , as long as it can handle characterlevel inputs and produce binary outputs. We have no strict assumption about the strings in T . They may contain sentences from a formal news article, pieces of HTML code, or a mixture of formal texts and informal texts. For this reason, we treat t i as a sequence of characters by default. NN architectures that meet our condition include but are not limited to: CNNs (Kim, 2014), BiL-STMs (Lai et al., 2015), and BiLSTM with selfattentions (Lin et al., 2017). For this paper we implemented a BiLSTM architecture. However, we also tested a CNN architecture, reaching similar conclusions. Our architecture contains an embedding layer to project each character into a vector, 2 BiLSTM layers to encode a sequence of character embeddings into a sequence of hidden vectors, and a max pooling layer followed by 2 fully-connected layers to project the hidden vectors into binary labels (Figure 1, on the right).
Fine-tuning: Once we have a model M RE trained on weak labels, there are multiple ways to improve the weak model with human annotations to get M RE+human . One common way is to freeze the parameters of all other layers of M RE and fine-tune the last fully-connected layer (Donahue et al., 2014). Felbo et al. (2017) propose a 'chain-thaw' strategy, which freezes all layers, then sequentially unfreezes and fine-tunes a single layer at a time. We exploit a less costly strategy as proposed in (Erhan et al., 2010), which uses the weights learned in M RE to initialize M RE+human , and start training M RE+human immediately with human annotations.

Experiment Design
We aim to answer three research questions in our experiments: Q1. Is it possible to train an accurate NN classifier with a limited number of human generated labels? Q2. What is the difference between REs and an NN pretrained on those REs? Q3. Does the quality of REs matter?

Data Sets
We use four datasets which are described in Table  1 (Bartoli et al., 2016). We generate 49,002 chuncked strings for this task. Each text instance contains the bill date (and time) and the location (index) of the datetime substring in the text. • Email Address: The dataset is a collection of publicly available Enron email addresses from (Li et al., 2008;Brauer et al., 2011). It has 29,035 chuncked strings in total. To answer the above 3 questions, we manually label 6,000 random strings from each data set. We create our training and testing data from this sample. The remaining strings are used to create D RE . We list the human effort spent on labelling each data set in Table 1.
20170925144113.js'defer> is an example of a negative instance. We give examples of positive and negative instances in Table 1. The presence of an entity mention of a desired type is highlighted in bold in positive instances. Our human annotated data sets are attached as supplementary materials.

Regular Expression Generation
The number and the source of REs used to train the deep model M RE are listed in Table 2.
For the recognition of Date/Time entity   For the task of course number identification, we used four REs, one of which is borrowed from the results learned by ReLIE (Li et al., 2008), and the remaining three are from (Murthy et al., 2012). The regular expressions to extract the entity mentions of date are all from (Murthy et al., 2012). For email address, we use the top five REs from the RE Library 3 website.

Experimental Setup
Models in Comparison We compare 5 models on the 4 data sets: (1) Naive, which always predicts 0, because 0 is the majority class on all 4 tasks. (2) RE, which uses the set of REs R designed by experts or tools to weakly label the strings. Although we have multiple REs for each task, R can be any subset of the available REs.
(3) M RE , which is the pretrained model of weak labels generated by R in model (2). (4) M human , which is the model trained with human annotations. (5) M RE+human , which is the fine tuned model M RE .
Evaluation Metrics For each data set, we divide the 6,000 strings with human annotations into 5 folds. We leave one fold as our test data. The training data is selected from the other 4 folds. We report four scores for each model: Accuracy (ACC), 3 http://www.regexlib.com/ F1, Precision, and Recall. We report the average results over three random repetitions in Section 5.
Hyperparameters We use 100 dimensions in the embedding layer. We set the activation function in the first fully connected layers as tanh. The batch size is set to 300. We also add dropout layers after the embedding layer, the max pooling layer, and the first fully-connected layer to avoid overfitting, with drop out rate at 0.5. Our implementation is in PyTorch. We tune the learning rate (lr), the hidden units size (nhidden) in BiLSTM layers and the output size (nfc) of the first fully-connected layer by 5-fold cross validation using a random 6,000 sample from D RE , for the sake of expediency. The ranges of selection are: lr ∈ [0.0002, 0.0005, 0.001, 0.002, 0.004, 0.008, 0.015], nhidden ∈ [50, 75, 100, 125, 150, 200] and nfc ∈ [20, 50, 100, 200, 500]. We use the random search algorithm proposed in (Bergstra and Bengio, 2012) that has been proved more effective than grid search. The hyperparameters used to train M human and M RE+human are identical to those used to train M RE .
We train 2 epochs for M RE on weakly labeled data. M human and M RE+human are trained for 50 epochs on strongly labeled data.

Experimental Results
In this section, we evaluate the proposed framework with extensive experiments on the 4 entity recognition tasks. We use the empirical results to understand how the quality of initial REs impacts our conclusions.

Entity Mention Detection with Limited Human Annotations
We report the comparisons of the 5 models in Table 3    annotations greatly increase the coverage of the initial REs for entity mentions. The Bill Date task is hard, since the initial REs already achieve Precision = 98% and entity mentions are really rare (ACC = 93% in Naive model). We still are able to improve the Recall by 15%, but at the expense of reduced Precision. This only gives a slight improvement in F1 (4.5%) and unchanged Accuracy. The Email Address task is even harder, with 90% Precision and 100% Recall from the initial REs. We fail to improve the accuracy with only 20 human annotations in this task.
To summarize, the answer to Q1, it is possible to train accurate NNs with limited amount of hu-man generated labels and large amount of weak labels generated automatically.

Effect of the Number of Human Annotations
In Figure 2 we show how F1 varies with the size of the strongly labeled dataset, |D human | = [20,50,100,200,500,1000,2000,4800]. task, and 2, 000 in Email Address task.

Case Studies
From Table 3, we observe that M RE achieves higher F1 scores than RE by about 2.3% on Date Time and 1.6% on Course Number tasks. This is a seemingly surprising outcome; NN trained on weak labels does better on human labeled test data than the REs used to generate the weak labels. To provide an insight, in Table 4 we illustrate example strings on which RE and M RE disagree. Group 1 consists of 19 strings where RE is correct and M RE is not. Group 2 consists of 48 strings where RE is incorrect and M RE is correct. Looking at the examples in the first and last 2 rows of the table, we find strings that match RE with YYYYMMDDhhmmss format. This is an unusual form and it is expected that it might not always correspond to datetime entity. We hypothesize that the neural network encountered many negative strings in the weakly labeled data that have a similar form and learned that this form is not a reliable predictor of datetime.
Rows 6 -8 in Table 4 illustrate the resilience of M RE to small variations in the original REs. For example, despite an extra space in row 6, a missing colon in row 7, and a mixture of spoken language in row 8, M RE is able to detect those entities, but REs are not. We give cases where M RE makes mistakes in rows 3 -5, showing that the NN is not able to learn the underlying REs with 100% accuracy.
To summarize, the answer to Q2, it appears that NNs are more noise resilient than REs.

Impact of the Initial REs
In this subsection, we investigate the impact of the choice of REs R on the accuracy of NN models. Since Naive and M human models are not affected by R, we compare only three models in this subsection: RE, M RE and M RE+human . We select 4 out of the 25 REs in the datetime task for this study. The 4 REs are of different quality. We list the 4 REs in Figure 3. An example pattern that RE1 matches is 20180503101212, for RE2 it is 2018-05-03 10:12:12, for RE3 it is 2018-05-03T10:12:12Z in UTC time zone and for RE4 it is 2018-05-03 10:12:12+00:00 or 2018-05-03 10:12:12-00:00. We also consider the quality of NNs trained on weak labels from the whole set of 25 REs, denoted as All.
In Table 5 we compare the performance of RE and M RE for the 5 different selections of REs for weak labeling. It can be observed that RE1 is the weakest individual RE in the group with F1 = 4.83, while RE4 is the strongest with F1= 41.86. Using all 25 REs gives the highest accuracy of F1= 76.77. We can see that M RE closely follows the performance of RE and it is interesting to observe that M RE becomes visibly superior only with good REs.
In Figure 3, we plot the 4 accuracy metrics for model M RE+human , which was pretrained on weakly labeled data generated by 5 different choices of REs, with varying initial RE sets and sizes of human annotations. We observe a significant influence of REs on accuracy. We can also observe that as the number of strong labels grows, the impact of RE choice decreases. When the number   of strong labels exceeds 1, 000, the impact of the RE choice becomes negligible. In summary, to answer Q3, there is a tradeoff between creating more REs and creating more strong labels: (1) If designing a comprehensive RE takes a lot of time, a good strategy may be to take some time to construct one moderately good RE and spend more time on data labeling. (2) If the pattern is easy to describe by an RE, it may be a good strategy to spend time on creating a better set of REs and spend less time on labeling.

Conclusions
The main premise of this work is that it is practically impossible to create REs capable of identifying entities with perfect precision and recall at web scale. This paper explores ways to combine the expressive power of REs, ability of deep learning, and human-in-the loop into a novel integrated framework for entity recognition in web data. The framework starts by creating or collecting the existing REs for a particular type of an entity type (e.g., emails). Those REs are then used over a large document corpus to collect weak labels for the entity mentions and an NN is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.
Web sources often change in ways that prevent the induced REs from extracting data correctly. At the web scale, we require automated tools to maintain them. One direction of future work is to use our framework to diagnose when a RE is broken over a text stream.