Low-resource Deep Entity Resolution with Transfer and Active Learning

Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.


Introduction
Entity Resolution (ER), also known as entity matching, record linkage (Fellegi and Sunter, 1969), reference reconciliation (Dong et al., 2005), and merge-purge (Hernández and Stolfo, 1995), identifies and links different representations of the same real-world entities. ER yields a unified and consistent view of data and serves as a crucial step in downstream applications, including knowledge base creation, text mining (Zhao et al., 2014), and social media analysis (Campbell et al., 2016). For instance, seen in Table 1 are citation data records from two databases, DBLP and Google Scholar. If one intends to build a system that analyzes citation networks of publications, it is essential to recognize publication overlaps across the databases and to integrate the data records (Pasula et al., 2002).
Recent work demonstrated that deep learning (DL) models with distributed representations of words are viable alternatives to other machine learning algorithms, including support vector machines and decision trees, for performing ER (Ebraheem et al., 2018;Mudgal et al., 2018). The DL models provide a universal solution to ER across all kinds of datasets that alleviates the necessity of expensive feature engineering, in which a human designer explicitly defines matching functions for every single ER scenario. However, DL is well known to be data hungry; in fact, the DL models proposed in Ebraheem et al. (2018); Mudgal et al. (2018) achieve state-of-the-art performance by learning from thousands of labels. 1 Unfortunately, realistic ER tasks have limited access to labeled data and would require substantial labeling effort upfront, before the actual learning of the ER models. Creating a representative training set is especially challenging in ER problems due to the data distribution, which is heavily skewed towards negative pairs (i.e. non-matches) as opposed to positive pairs (i.e. matches).
This problem limits the applicability of DL methods in low-resource ER scenarios. Indeed, we will show in a later section that the performance of DL models degrades significantly as compared to other machine learning algorithms when only a limited amount of labeled data is available. To address this issue, we propose a DLbased method that combines transfer learning and 2 Background and Related Work

Entity Resolution
Let D 1 and D 2 be two collections of entity records. The task of ER is to classify the entity record pair e 1 , e 2 , ∀e 1 ∈ D 1 , e 2 ∈ D 2 , into a match or a non-match. This is accomplished by comparing entity record e 1 to e 2 on their corresponding attributes. In this paper, we assume records in D 1 and D 2 share the same schema (set of attributes). In cases where they have different attributes, one can use schema matching techniques (Rahm and Bernstein, 2001) to first align the schemas, followed by data exchange techniques (Fagin et al., 2009). Each attribute value is a sequence of words. Table 1 shows examples of data records from an ER scenario, DBLP-Scholar (Köpcke et al., 2010) from the citation genre and clearly depicts our assumption of datasets handled in this paper.
Since the entire Cartesian product D 1 × D 2 often becomes large and it is infeasible to run a high-recall classifier directly, we typically decompose the problem into two steps: blocking and matching. Blocking filters out obvious nonmatches from the Cartesian product to obtain a candidate set. Attribute-level or record-level tf-idf and jaccard similarity can be used for blocking criteria. For example, in the DBLP-Scholar scenario, one blocking condition could be based on applying equality on "Year". Hence, two publications in different years will be considered as obvious nonmatches and filtered out from the candidate set. Then, the subsequent matching phase classifies the candidate set into matches and non-matches.

Learning-based Entity Resolution
As described above, after the blocking step, ER reduces to a binary classification task on candidate pairs of data records. Prior work has proposed learning-based methods that train classifiers on training data, such as support vector machines, naive bayes, and decision trees (Christen, 2008;Bilenko and Mooney, 2003). These learningbased methods first extract features for each record pair from the candidate set across attributes in the schema, and use them to train a binary classifier. The process of selecting appropriate classification features is often called feature engineering and it involves substantial human effort in each ER scenario. Recently, Ebraheem et al. (2018) and Mudgal et al. (2018) have proposed deep learning models that use distributed representations of entity record pairs for classification. These models benefit from distributed representations of words and learn complex features automatically without the need for dataset-specific feature engineering.

Deep ER Model Architecture
We describe the architecture of our DL model that classifies each record pair in the candidate set into a match or a non-match. As shown in Fig. 1, our model encompasses a sequence of steps that computes attribute representations, attribute similarity and finally the record similarity for each input pair e 1 , e 2 . A matching classifier uses the record similarity representation to classify the pair. For an extensive list of hyperparameters and training details we chose, see the appendix.
Input Representations. For each entity record pair e 1 , e 2 , we tokenize the attribute values and vectorize the words by external word embeddings to obtain input representations (W s in Fig. 1). We use the 300 dimensional fastText embeddings (Bojanowski et al., 2017), which capture subword information by producing word vectors via character n-grams. This vectorization has the benefit of well representing out-of-vocabulary words (Bojanowski et al., 2017) that frequently appear in ER attributes. For instance, venue names SIGMOD and ACL are out of vocabulary in the publicly available GloVe vectors (Pennington et al., 2014), but we clearly need to distinguish them.
Attribute Representations. We build a universal bidirectional RNN on the word input representations of each attribute value and obtain attribute vectors (attr 1 and attr 2 in Fig. 1) by concatenating the last hidden units from both directions. Crucially, the universal RNN allows for transfer learning between datasets of different schemas without error-prone schema mapping. We found that gated recurrent units (GRUs, Cho et al. (2014)) yielded the best performance on the dev set as compared to simple recurrent neural networks (SRNNs, Elman (1990)) and Long Short-Term Memory networks (LSTMs, Hochreiter and Schmidhuber (1997)). We also found that using BiGRU with multiple layers did not help, and we will use one-layer Bi-GRUs with 150 hidden units throughout the experiments below.
Attribute Similarity. The resultant attribute representations are then used to compare attributes of each entity record pair. In particular, we compute the element-wise absolute difference between the two attribute vectors for each attribute and construct attribute similarity vectors (sim 1 and sim 2 in Fig. 1). We also considered other comparison mechanisms such as concatenation and elementwise multiplication, but we found that absolute difference performs the best in development, and we will report results from absolute difference.
Record Similarity. Given the attribute similarity vectors, we now combine those vectors to represent the similarity between the input entity record pair. Here, we take a simple but effective approach of adding all attribute similarity vectors (sim in Fig. 1). This way of combining vectors ensures that the final similarity vector is of the same dimensionality regardless of the number of attributes and facilitates transfer of all the subsequent parameters. For instance, the DBLP-Scholar and Cora 2 datasets have four and eight attributes respectively, but the networks can share all weights and biases between the two. We also tried methods such as max pooling and average pooling, but none of them outperformed the simple addition method.
Matching Classification. We finally feed the similarity vector for the two records to a two-layer multilayer perceptron (MLP) with highway connections (Srivastava et al., 2015) and classify the pair into a match or a non-match ("Matching Classifier" in Fig. 1). The output from the final layer of the MLP is a two dimensional vector and we normalize it by the softmax function to obtain a probability distribution. We will discuss dataset adaptation for transfer learning in the next section.
Training Objectives. We train the networks to minimize the negative log-likelihood loss. We use the Adam optimization algorithm (Kingma and Ba, 2015) with batch size 16 and an initial learning rate of 0.001, and after each epoch we evaluate our model on the dev set. Training terminates after 20 epochs, and we choose the model that yields the best F1 score on the dev set and evaluate the model on the test data.

Deep Transfer Active Learning for ER
We introduce two orthogonal frameworks for our deep ER models in low resource settings: transfer and active learning. We also introduce the notion of likely false positives and likely false negatives, and provide a principled active labeling method in the context of deep ER models, which contributes to stable and high performance.

Adversarial Transfer Learning
The architecture described above allows for simple transfer learning: we can train all parameters in the network on source data and use them to classify a target dataset. However, this method of transfer learning can suffer from dataset-specific properties. For example, the author attribute in the DBLP-ACM dataset contains first names while that in the DBLP-Scholar dataset only has first initials. In such situations, it becomes crucial to construct network representations that are invariant with respect to idiosyncratic properties of datasets. To this end, we apply the technique of dataset (domain) adaptation developed in image recognition (Ganin and Lempitsky, 2015). In particular, we build a dataset classifier with the same architecture as the matching classifier ("Dataset Classifier" in Fig. 1) that predicts which dataset the input pair comes from. We replace the training objective by the sum of the negative log-likelihood losses from the two classifiers. We add a gradient reversal layer between the similarity vector and the dataset classifier so that the parameters in the dataset classifier are trained to predict the dataset while the rest of the network is trained to mislead the dataset classifier, thereby developing dataset-independent internal representations. Crucially, with dataset adaptation, we feed pairs from the target dataset as well as the source to the network. For the pairs from the target, we disregard the loss from the matching classifier.

Active Learning
Since labeling a large number of pairs for each ER scenario clearly does not scale, prior work in ER has adopted active learning as a more guided approach to select examples to label (Tejada et al., 2001;Sarawagi and Bhamidipaty, 2002;Arasu et al., 2010;de Freitas et al., 2010;Isele and Bizer, 2013;Qian et al., 2017).
Designing an effective active learning algorithm for deep ER models is particularly challenging because finding informative examples is very difficult (especially for positive examples due to the extremely low matching ratio in realistic ER tasks), and we need more than a handful of both negative and positive examples in order to tune a deep ER model with many parameters.
To address this issue, we design an iterative active learning algorithm (Algorithm 1) that searches for two different types of examples from unlabeled data in each iteration: (1) uncertain examples including likely false positives and likely false negatives, which will be labeled by human annotators; (2) high-confidence examples including high-confidence positives and high-confidence negatives. We will not label high-confidence examples and use predicted labels as a proxy. We will show below that those carefully selected examples serve different purposes.
Uncertain examples and high-confidence examples are characterized by the entropy of the conditional probability distribution given by the current model. Let K be the sampling size and the unlabeled dataset consisting of candidate record . Denote the probability that record pair x i is a match according to the current model by p(x i ). Then, the conditional entropy of the pair H (x i ) is computed by: Uncertain examples and high-confidence examples are associated with high and low entropy.
Given this notion of uncertainty and high confidence, one can simply select record pairs with top K entropy as uncertain examples and those with bottom K entropy as high-confidence examples. Namely, take as sets of uncertain and high-confidence examples respectively. However, these simple criteria can introduce an unintended bias toward a certain direction, resulting in unstable performance. For example, uncertain examples selected solely on the basis of entropy can sometimes contain substantially more negative examples than positive ones, leading the network to a solution with low recall. To address this instability problem, we propose a partition sampling mechanism. We first partition the unlabeled data D U into two subsets: D U and D U , consisting of pairs that the model predicts as matches and non-matches respectively. Namely, where the two criteria correspond to highconfidence positives and high-confidence negatives respectively. These sampling criteria equally partition uncertain examples and high-confidence examples into different categories. We will show that the partition mechanism contributes to stable and better performance in a later section.

Algorithm 1 Deep Transfer Active Learning
Require: Unlabeled data D U , sampling size K, batch size B, max. iteration number T , max. number of epochs I. Select k high-confidence positives and k highconfidence negatives from D U and add them with positive and negative labels to D L . 5: for

Experimental Setup
For all datasets, we first conduct blocking to reduce the Cartesian product to a candidate set. Then, we randomly split the candidate set into training, development, and test data with a ratio of 3:1:1. For the datasets used in Mudgal et al. (2018) (DBLP-ACM, DBLP-Scholar, Fodors-Zagats, and Amazon-Google), we adopted the same feature-based blocking strategies and random splits to ensure comparability with the state-of-the-art method. The candidate set of Cora was obtained by randomly sampling 50,000 pairs from the result of the jaccard similarity-based blocking strategy described in Wang et al. (2011). The candidate set of Zomato-Yelp was taken from Das et al. (2016). 3 All dataset statistics are given in Table 2. For evaluation, we compute precision, recall, and F1 score on the test sets. In the active learning experiments, we hold out the test sets a priori and sample solely from the training data to ensure fair comparison with non-active learning methods. The sampling size K for active learning is 20. As preprocessing, we tokenize with NLTK (Bird et al., 2009) and lowercase all attribute values. For every configuration, we run experiments with 5 random initializations and report the average. Our DL models are all implemented using the publicly available deepmatcher library. 4

Baselines
We establish baselines using a state-of-the-art learning-based ER package, Magellan (Konda et al., 2016). We experimented with the following 6 learning algorithms: Decision Tree, SVM, Ran- dom Forest, Naive Bayes, Logistic Regression, and Linear Regression. We use the same feature set as in Mudgal et al. (2018). See the appendix for extensive lists of features chosen.

Results and Discussions
Model Performance and Data Size. Seen in Fig.  2 is F1 performance of different models with varying data size on DBLP-ACM. The DL model improves dramatically as the data size increases and achieves the best performance among the 7 models when 7000 training examples are available. In contrast, the other models suffer much less from data scarcity with an exception of Random Forest. We observed similar patterns in DBLP-Scholar and Cora. These results confirm our hypothesis that deep ER models are data-hungry and require a lot of labeled data to perform well.
Transfer Learning. Table 3 shows results from our transfer learning framework when used in isolation (i.e., without active learning, which we will discuss shortly). Our dataset adaptation method substantially ameliorates performance when the target is DBLP-Scholar (from 41.03 to 53.84 F1 points) or Cora (from 38.3 to 43.13 F1 points) and achieves the same level of performance on DBLP-ACM. Transfer learning with our dataset adaptation technique achieves a certain level of performance without any target labels, but we still observe high variance in performance (e.g. 6.21 standard deviation in DBLP-Scholar) and a huge discrepancy between transfer learning and training directly on the target dataset. To build a reliable and stable ER model, a certain amount of target labels may be necessary, which leads us to apply our active learning framework.
Active Learning. Fig. 3 shows results from our active learning as well as the 7 algorithms trained on labeled examples of corresponding size that are   randomly sampled. 5 Deep transfer active learning (DTAL) initializes the network parameters by transfer learning whereas deep active learning (DAL) starts with a random initialization. We can observe that DTAL models remedy the data scarcity problem as compared to DL models with random sampling in all three datasets. DAL can achieve competitive performance to DTAL at the expense of faster convergence. Seen in Table 4 is performance comparison of different algorithms in low-resource and highresource settings. (We only show the SVM results since SVM performed best in each configuration among the 6 non-DL algorithms.) First, deep transfer active learning (DTAL) achieves the best performance in the low-resource setting of each dataset. In particular, DTAL outperforms the others to the greatest degree in Cora (97.68 F1 points) probably because Cora is the most complex dataset with 8 attributes in the schema. Non-DL algorithms require many interaction features, which lead to data sparsity. Deep active learning (DAL) also outperforms SVM and yields comparable performance to DTAL. However, the standard deviations in performance of DAL are substantially higher than those of DTAL (e.g. 4.15 5 We average the results over 5 random samplings. vs. 0.33 in DBLP-ACM), suggesting that transfer learning provides useful initializations for active learning to achieve stable performance.
One can argue that DTAL performs best in the low-resource scenario, but the other algorithms can also boost their low-resource performance by active learning. While there are many approaches to active learning on feature-based (non-DL) ER (e.g. Bellare et al. (2012);Qian et al. (2017)) that yield strong performance under certain condition, it requires further research to quantify how these methods perform with varying datasets, genres, and blocking functions. It should be noted, however, that in DBLP-Scholar and Cora, DTAL in the low-resource setting even significantly outperforms SVM (and the other 5 algorithms) in the high-resource scenario. These results imply that DTAL would significantly outperform SVM with active learning in the low-resource setting since the performance with the full training data with labels serves as an upper bound. Moreover, we can observe that DTAL with a limited amount of data (less than 6% of training data in all datasets), performs comparably to DL models with full training data. Therefore, we have demonstrated that a deep ER system with our transfer and active learning frameworks can provide a stable and reliable solu-  tion to entity resolution with low annotation effort.
Other Genre Results. We present results from the restaurant and software genres. 6 Shown in Table 5 are results of transfer and active learning from Zomato-Yelp to Fodors-Zagats. Similarly to our extensive experiments in the citation genre, the dataset adaptation technique facilitates transfer learning significantly, and only 100 active learning labels are needed to achieve the same performance as the model trained with all target labels (894 labels). Fig. 4 shows low-resource performance in the software genre. The relative performance among the 6 non-DL approaches differs to a great degree as the best non-DL model is now logistic regression, but deep active learning outperforms the rest with 1200 labeled examples (10.4% of training data). These results illustrate that our low-resource frameworks are effective in other genres as well.
Active Learning Sampling Strategies. As discussed in a previous section, we adopted highconfidence sampling and a partition mechanism for our active learning. Here we analyze the effect of the two methods. Table 6 shows deep transfer active learning performance in DBLP-ACM with varying sampling strategies. We can observe that high-confidence sampling and the partition mech-   anism contribute to high and stable performance as well as good precision-recall balance. Notice that there is a huge jump in recall by adding partition while precision stays the same (row 4 to row 3). This is due to the fact that the partition mechanism succeeds in finding more false negatives. The breakdown of labeled examples (Table  7) shows that is indeed the case. It is noteworthy that the partition mechanism lowers the ratio of misclassified examples (FP+FN) in the labeled sample set because partitioning encourages us to choose likely false negatives more aggressively, yet false negatives tend to be more challenging to find in entity resolution due to the skewness toward the negative (Qian et al., 2017). We observed similar patterns in DBLP-Scholar and Cora.

Further Related Work
Transfer learning has proven successful in fields such as computer vision and natural language processing, where networks for a target task is pretrained on a source task with plenty of training data (e.g. image classification (Donahue et al., 2014) and language modeling (Peters et al., 2018)). In this work, we developed a transfer learning framework for a deep ER model. Concurrent work  to ours has also proposed transfer learning on top of the features from distributed representations, but they focused on classical machine learning classifiers (e.g., logistic regression, SVMs, decision trees, random forests) and they did not con-   Table 7: Breakdown of 300 labeled samples (uncertain samples) from deep transfer active learning in DBLP-ACM. Part, FP, TP, FN, and TN denote the partition mechanism, false positives, true positives, false negatives, and true negatives respectively. sider active learning. Their distributed representations are computed in a "bag-of-words" fashion, which can make applications to textual attributes more challenging (Mudgal et al., 2018). Moreover, their method breaks attribute boundaries for tuple representations in contrast to our approach that computes a similarity vector for each attribute in an attribute-agnostic manner. In a complex ER scenario, each entity record is represented by a large number of attributes, and comparing tuples as a single string can be infeasible. Other prior work also proposed a transfer learning framework for linear model-based learners in ER (Negahban et al., 2012).

Conclusion
We presented transfer learning and active learning frameworks for entity resolution with deep learning and demonstrated that our models can achieve competitive, if not better, performance as compared to state-of-the-art learning-based methods while only using an order of magnitude less labeled data. Although our transfer learning alone did not suffice to construct a reliable and stable entity resolution system, it contributed to faster convergence and stable performance when used together with active learning. These results serve as further support for the claim that deep learning can provide a unified data integration method for downstream NLP tasks. Our frameworks of transfer and active learning for deep learning models are potentially applicable to low-resource settings beyond entity resolution.