Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled Learning

In this work, we introduce a bootstrapped, iterative NER model that integrates a PU learning algorithm for recognizing named entities in a low-resource setting. Our approach combines dictionary-based labeling with syntactically-informed label expansion to efficiently enrich the seed dictionaries. Experimental results on a dataset of manually annotated e-commerce product descriptions demonstrate the effectiveness of the proposed framework.


Introduction
The vast majority of existing named entity recognition (NER) methods focus on a small set of prominent entity types, such as persons, organizations, diseases, and genes, for which labeled datasets are readily available (Tjong Kim Sang and De Meulder, 2003;Smith et al., 2008;Weischedel et al., 2011;Li et al., 2016). There is a marked lack of studies in many other domains, such as e-commerce, and for novel entity types, e.g. products and components.
The lack of annotated datasets in the ecommerce domain makes it hard to apply supervised NER methods. An alternative approach is to use dictionaries (Nadeau et al., 2006;Yang et al., 2018), but freely available knowledge resources, e.g. Wikidata (Vrandečic and Krötzsch, 2014) or YAGO (Suchanek et al., 2007), contain only very limited information about e-commerce entities. Manually creating a dictionary of sufficient quality and coverage would be prohibitively expensive. This is amplified by the fact that in the e-commerce domain, entities are frequently ex-pressed as complex noun phrases instead of proper names. Product and component category terms are often combined with brand names, model numbers, and attributes ("hard drive" → "SSD hard drive" → "WD Blue 500 GB SSD hard drive"), which are almost impossible to enumerate exhaustively. In such a low-coverage setting, employing a simple dictionary-based approach would result in very low recall, and yield very noisy labels when used as a source of labels for a supervised machine learning algorithm. To address the drawbacks of dictionary-based labeling, Peng et al. (2019) propose a positive-unlabeled (PU) NER approach that labels positive instances using a seed dictionary, but makes no label assumptions for the remaining tokens (Bekker and Davis, 2018). The authors validate their approach on the CoNLL, MUC and Twitter datasets for standard entity types, but it is unclear how their approach transfers to the ecommerce domain and its entity types.
Contributions We adopt the PU algorithm of Peng et al. (2019) to the domain of consumer electronic product descriptions, and evaluate its effectiveness on four entity types: Product, Component, Brand and Attribute. Our algorithm bootstraps NER with a seed dictionary, iteratively labels more data and expands the dictionary, while accounting for accumulated errors from model predictions. During labeling, we utilize dependency parsing to efficiently expand dictionary matches in text. Our experiments on a novel dataset of product descriptions show that this labeling mechanism, combined with a PU learning strategy, consistently improves F1 scores over a standard BiLSTM classifier. Iterative learning quickly expands the dictionary, and further improves model performance. The proposed approach exhibits much better recall than the baseline model, and generalizes better to unseen entities.
Algorithm 1: Iterative Bootstrapping NER Input: Dictionary D seed , Corpus C, threshold K, max iterations I Result: Dictionary D + , Classifier L D + ← D seed ; C dep ← dependency parse(C); i ← 0; while not converged(D + ) and i < I do C lab ← label(C, D + ); C exp ← expand labels(C lab , C dep ); L ← train classif ier(C exp ); C pred ← predict(C exp , L); for e ← C pred do if e / ∈ D + and freq(e) > K then D + ← add entity(D + , e); end end i ← i + 1; end

NER with Positive Unlabeled Learning
In this section, we first describe the iterative bootstrapping process, followed by our approach to positive unlabeled learning for NER (PU-NER).

Iterative Bootstrapping
The goal of iterative bootstrapping is to successively expand a seed dictionary of entities to label an existing training dataset, improving the quality and coverage of labels in each iteration (see Algorithm 1). In the first step, we use the seed dictionary to assign initial labels to each token. We then utilize the dependency parses of sentences to label tokens in a "compound" relation with already labeled tokens (see Figure 1). In the example "hard drive" is labeled a Component based on the initial seed dictionary, and according to its dependency parse it has a "compound" relation with "dock", which is therefore also labeled as a Component. We employ an IO label scheme, because dictionary entries are often more generic than the specific matches in text (see the previous example), which would lead to erroneous tags with schemes such as BIO.
In the second step, we train a NER model on the training dataset with new labels assigned. We repeat these steps at most I times, and in each subsequent iteration we use the trained model to predict new token-level labels on the training data. Novel entities predicted more than K times are included in the dictionary for the next labeling step. The  threshold K ensures that we do not introduce noise in the dictionary with spurious positively labeled entities.

PU-NER Model
As shown in Figure 2, our model first uses BERT (Devlin et al., 2018) to encode the sub-word tokenized input text into a sequence of contextualized token representations {z 1 , ..., z L }, followed by a bidirectional LSTM (Lample et al., 2016) layer to model further interactions between tokens. Similar to Devlin et al. (2018), we treat NER as a tokenlevel classification task, without using a CRF to model dependencies between entity labels. We use the vector associated with the first sub-word token in each word as the input to the entity classifier, which consists of a feedforward neural network with a single projection layer. We use back propagation to update the training parameters of the Bi-LSTM and the final classifier, without fine-tuning the entire BERT model.
Dictionary-based labeling achieves high precision on the matched entities but low recall. This fits the positive unlabeled setting (Elkan and Noto, 2008), which assumes that a learner only has access to positive examples and unlabeled data. Thus, we consider all tokens matched by the dictionary as positive, and consider all other tokens to be unlabeled. The goal of PU learning is then to estimate the true risk regarding the expected number of positive examples remaining in the unlabeled data. We define the empirical risk asR l = 1 n n i l(ŷ i , y i ) and assume the class prior to be equal to real distribution of examples in the data π p = P (Y = 1), and π n = P (Y = 0). As the model tends to predict the positive labels correctly during training, i.e. l(ŷ i p , 1) declines to a small value. We follow Peng et al. (2019) and combine risk estimation with a non-negative constraint: 3 Dataset E-commerce covers a wide range of complex entity types. In this work, we focus on electronic products, e.g. personal computers, mobile phones, and related hardware, and define the following entity types: Products, i.e. electronic consumer devices such as mobiles, laptops, and PCs. Products may be preceded by a brand and include some form of model, year, or version specification, e.g. "Galaxy S8" or "Dell Latitude 6400 multimedia notebook".
Components are parts of a product, typically with a physical aspect, e.g. "battery", or "multimedia keyboard". 1 Brand refers to producers of a product or component, e.g. "Samsung", or "Dell". Attributes are units associated with components, e.g. size ("4 TB"), or weight ("3 kg"). To create our evaluation dataset, we use the Amazon review dataset (McAuley et al., 2015), 2 a collection of product metadata and customer reviews from Amazon. The metadata includes product title, a descriptive text, category information, price, brand, and image features. We use only entries in the Electronics/Computers subcategory and randomly sample product descriptions of length 500-1000 characters, yielding a dataset of 24,272 training documents. We randomly select another 100 product descriptions to form the final test set. These are manually annotated by 2 trained linguists, with disagreements resolved by a third expert annotator. Token-level inter-annotator agreement was 1 Non-physical product features and software, such as "Toshiba Face Recognition Software", or "Windows 7" are not considered as components.

Experiments
To evaluate our proposed model (PU), we compare it against two baselines: (1) dictionary-only labeling (Dictionary), and (2) our model with standard cross-entropy loss instead of the PU learning risk (BiLSTM). The BiLSTM model is trained in a supervised fashion, treating all non-dictionary entries as negative tokens. The BiLSTM and PU models were implemented using AllenNLP (Gardner et al., 2018). We use SpaCy 3 for preprocessing, dependency parsing, and dictionary-based entity labeling. We manually define seed dictionaries for Product (6 entries), Component (60 entries) and Brand (13 entries). For Attributes, we define a set of 8 regular expressions to pre-label the dataset. Following previous works, we evaluate model performance using token-level F1 score. There are two options to estimate the value of the class prior π p . One approach is to treat π p as a hyperparameter which is fixed during training. Another option is suggested in Peng et al. (2019), who specify an initial value for π p to start bootstrapping, but recalculate π p after several train-relabel steps based on the predicted entity type distribution. In our work, we treated π p as a fixed hyperparameter with a value of π p = 0.01.
We run our bootstrapping approach for I = 10 iterations, and report the F1 score of the best iteration. Table 1 shows the F1 scores of several model ablations by entity type on our test dataset. From the table, we can observe: 1) The PU algorithm outperforms the simpler models for most classes, which demonstrates the effectiveness of the PU learning framework for NER in our domain. 2) Dependency parsing is a very effective feature for Component and Product, and it strongly improves the overall F1 score. 3) The iterative training strategy yields a significant improvement for most classes. Even after several iterations, it still finds new entries to expand the dictionaries (Figure 3).

Results and Discussion
The Dictionary approach shows poor performance on average, which is due to the low recall  Table 1: Token-Level F1 scores on the test set. The unmodified PU algorithm achieves an average F1 score of 68.95%. Integrating dependency parsing (Dep) and iterative relabeling (Iter) raises the F1 score to 72.02%, an improvement of 42.08% over a dictionary-only approach, and 3.63% over a BiLSTM baseline.
caused by very limited entities in the dictionary. PU greatly outperforms the dictionary approach, and has an edge in F1 score over the BiLSTM model. The advantages of PU gradually accumulate with each iteration. For Product, the combination of PU learning, dependency parsing-based labeling, and iterative bootstrapping, yields a 7% improvement in F1 score, for Component, it is still 5%. PU Learning Performance Figure 3 shows that the PU algorithm especially improves recall over the baseline classifier for Components, Products and Brands. With each iteration step, the PU model is increasingly better able to predict unseen entities, and achieves higher recall scores than the BiLSTM model. While the baseline curve on Brands stays almost flat during iterations, PU consistently improves recall as new entities are added into dictionary. For Attributes, however, both models exhibit about the same level of recall, which in addition is largely unaffected by the number of iterations. This suggests that PU learning better estimates the true loss in the model. In a fully supervised setting, a standard classification loss function can accurately describe the loss on positive and negative samples. However, in the positive unlabeled setting, many unlabeled samples may actually be positive, and therefore the computed loss should not strongly push the model towards the negative class. We therefore want to quantify how much the loss is overestimated due to false negative samples, so that we can appropriately reduce this loss using the estimated real class distribution.
Error Analysis Both PU and the baseline model in some cases have difficulties predicting Attributes correctly. This can be due to spelling differences between train and test data (e.g. "8 Mhz" vs "8Mhz"), but also because of unclean texts in the source documents. Another source of errors is the fixed word piece vocabulary of the pre-trained BERT model, which often splits unit terms such as "Mhz" into several word pieces. Since we use only the first word piece of a token for prediction, this means that signals important for prediction of the Attribute class may get lost. This suggests that for technical domains with very specific vocabulary, tokenization is important to allow the model to better represent the meaning of each word piece.

Related work
Recent work in positive-unlabeled learning in the area of NLP includes deceptive review detection (Ren et al., 2014), keyphrase extraction (Sterckx et al., 2016) and fact check-worthiness detection (Wright and Augenstein, 2020), see also (Bekker and Davis, 2018) for a survey. Our approach extends the work of Peng et al. (2019) in a novel domain and for challenging entity types. In the area of NER for e-commerce, Putthividhya and Hu (2011) present an approach to extract product attributes and values from product listing titles. Zheng et al. (2018) formulate missing attribute value extraction as a sequence tagging problem, and present a BiLSTM-CRF model with attention. Pazhouhi (2018) studies the problem of product name recognition, but uses a fully supervised approach. In contrast, our method is semi-supervised and uses only very few seed labels.

Conclusion
In this work, we introduce a bootstrapped, iterative NER model that integrates a PU learning algorithm for recognizing named entities in a low-resource setting. Our approach combines dictionary-based labeling with syntactically-informed label expansion to efficiently enrich the seed dictionaries. Experimental results on a dataset of manually annotated e-commerce product descriptions demonstrate the effectiveness of the proposed framework.