Handling Rare Entities for Neural Sequence Labeling

One great challenge in neural sequence labeling is the data sparsity problem for rare entity words and phrases. Most of test set entities appear only few times and are even unseen in training corpus, yielding large number of out-of-vocabulary (OOV) and low-frequency (LF) entities during evaluation. In this work, we propose approaches to address this problem. For OOV entities, we introduce local context reconstruction to implicitly incorporate contextual information into their representations. For LF entities, we present delexicalized entity identification to explicitly extract their frequency-agnostic and entity-type-specific representations. Extensive experiments on multiple benchmark datasets show that our model has significantly outperformed all previous methods and achieved new start-of-the-art results. Notably, our methods surpass the model fine-tuned on pre-trained language models without external resource.


Introduction
In the context of natural language processing (NLP), the goal of sequence labeling is to assign a categorical label to each entity word or phrase in a text sequence. It is a fundamental area that underlies a range of applications including slot filling and named entity recognition. Traditional methods use statistical models. Recent approaches have been based on neural networks (Collobert et al., 2011;Mesnil et al., 2014;Ma and Hovy, 2016;Strubell et al., 2017;Devlin et al., 2018;Liu et al., 2019a;Luo et al., 2020;Xin et al., 2018) and they have made great progresses in various sequence labeling tasks.
However, a great challenge to neural-networkbased approaches is from the data sparsity problem (Augenstein et al., 2017). Specifically in the context of sequence labeling, the majority of entities  (Frequency = 0) in the training set. Low frequency entities are those with fewer than ten occurrences (Frequency < 10). Percentages of entity occurrences are also shown. Data source is  in test dataset may occur in training corpus few times or are absent at all. In this paper, we refer this phenomenon particularly to rare entity problem. It is different from other types of data sparsity problems such as the lack of training data for lowresource language (Lin et al., 2018), as this rare entity problem is more related to a mismatch of entity distributions between training and test, rather than the size of training data. We present an example of the problem in Table 1. It shows that less than 5% of test set entities are frequently observed in the training set, and about 65% of test set entities are absent from the training set. The rare entities can be categorized into two types: out-of-vocabulary (OOV) for those test set entities that are not observed in the training set, and low frequency (LF) for those entities with low frequency (e.g., fewer than 10) occurrences in the training set. Without proper processing, rare entities can incur the following risks when building a neural network. Firstly, OOV terms may act as noise for inference, as they lack lexical information from training set (Bazzi, 2002). Secondly, it is hard to obtain high-quality representations on LF entities (Gong et al., 2018). Lastly, high occurrences of OOV and LF entities expose distribution discrep-ancy between training and test, which mostly leads to poor performances during test.
In general, there are two existing strategies attempting to mitigate the above issues: external resource and transfer learning. The external resource approach, for example (Huang et al., 2015;, uses external knowledge such as part-of-speech tags for NER or additional information from intent detection for slot filling. However, external knowledge such as part-of-speech tag is not always available for practical applications and open source taggers such as  may perform poorly for cross-domain annotations. Character or n-gram feature are mainly designed to deal with morphologically similar OOV words. The transfer learning approach, such as using ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), fine-tunes pre-trained models on the downstream task (Liu et al., 2019a). Nevertheless, it is not directly addressing problems such as entity distribution discrepancy between training and test. Moreover, our proposed methods surpass these methods without resorting to external resources nor large pre-trained language models. This paper proposes novel techniques that enable sequence labeling models to achieve state-of-theart performances without using external resource nor transfer learning. These are • local context reconstruction (LCR), which is applied on OOV entities, and • delexicalized entity identification (DEI), which is applied on LF entities.
Local context reconstruction enables OOV entities to be related to their contexts. One key point is applying variational autoencoder to model this reconstruction process that is typically a one-to-many generation process. Delexicalized entity identification aims at extracting frequency-agnostic and entity-type-specific representation, therefore reducing the reliance on high-frequency occurrence of entities 1 . It uses a novel adversarial training technique to achieve this goal. Both methods use an effective random entity masking strategy.
We evaluate the methods on sequence labeling tasks on several benchmark datasets. Extensive experiments show that the proposed methods significantly outperform previous models by a large margin. Detailed analysis indicates that the proposed methods indeed alleviate the rare entity problem. Notably, without using any external knowledge nor pre-trained models, the proposed methods surpass the model that uses fine-tuned BERT.

Background
Given an input sequence X = [x 1 , x 2 , · · · , x N ] with N tokens, the sequence labeling task aims at learning a functional mapping to obtain a target label sequence Y = [y 1 , y 2 , · · · , y N ] with equal length. In the following, we briefly introduce a typical method for sequence labeling and review related techniques we use in deriving our model.

Bidirectional RNN + CRF
Recurrent neural network (RNN) (Hochreiter and Schmidhuber, 1997) has been widely used for sequence labeling. The majority of high performance models use bidirectional RNN (Schuster and Paliwal, 1997) to encode input sequence X and conditional random field (CRF) (Lafferty et al., 2001) as a decoder to output Y .
The bidirectional RNN firstly embeds observation x i at each position i to a continuous space x i . It then applies forward and backward operations on the whole sequence time-recursively as .
(1) CRF computes the probability of a label sequence Y given X as where ⊕ denotes concatenation operation. G and W are learnable matrices. The sequence with the maximum score is the output of the model, typically obtained using Viterbi algorithm. We use bidirectional RNN + CRF model, in particular, Bi-LSTM+CRF (Huang et al., 2015), as the baseline model in our framework and it is referred in the bottom part of Figure 1.

Variational Autoencoder
The above model, together with other encoderdecoder models (Sutskever et al., 2014;Bahdanau et al., 2014), learn deterministic and discriminative functional mappings. The variational auto-encoder (VAE) (Kingma and Welling, 2015; Rezende et al., [SOS] list flights to indianapolis with fares on monday morning , please . [EOS] Local Context Reconstruction  Figure 1: Overall framework to use local context reconstruction and delexicalized entity identification for neural sequence labeling. "[SOS]" and "[EOS]" are used for marking sequence begining and sequence ending, respectively. The local context reconstruction is applied between any two sucessive entities, including the special entities. The delexicalized entity identitification is applied for all entities except for the special entities. 2014;Bowman et al., 2015), on the other hand, is stochastic and generative.
Using VAE, we may assume a sequence x = [x 1 , x 2 , · · · , x N ] is generated stochastically from a latent global variable z with a joint probability of where p(z) is the prior probability of z, generally a simple Gaussian distribution, to keep the model from generating x deterministically. p(x|z) represents a generation density, usually modeled with a conditional language model with initial state of z.
Maximum likelihood training of a model for Eq. (3) involves computationally intractable integration of z. To circumvent this, VAE uses variational inference with variational distribution of z coming from a Gaussian density q(z|x) = N (µ, diag(σ 2 )), with vector mean µ and diagonal matrix variance diag(σ 2 ) parameterized by neural networks. VAE also uses reparameterization trick to obtain latent variable z as follows: where is sampled from standard Gaussian distribution and denotes element-wise product. The evidence lower bound (ELBO) of the likelihood p(x) is obtained using Jensen's inequality E q(z|x) log p(x, z) ≤ log p(x) as follows: where KL(q||p) and CE(q|p) respectively denote the Kullback-Leibler divergence and the crossentropy between distribution q and p. ELBO can be optimized by alternating between optimizations of parameters of q(z|x) and p(x|z). We apply VAE for local context reconstruction from slot/entity tags in Figure 1. This is a generation process that is inherently one-to-many. We observe that VAE is superior to the deterministic model (Bahdanau et al., 2014) in learning representations of rare entities.

Adversarial Training
Adversarial training (Goodfellow et al., 2014), originally proposed to improve robustness to noise in image, is later extended to NLP tasks such as text classification (Miyato et al., 2015(Miyato et al., , 2016 and learning word representation (Gong et al., 2018).
We apply adversarial training to learn better representations of low frequency entities via delexicalized entity identification in Figure 1. It has a discriminator to differentiate representations from the original low-frequency entities and the representations of the delexicalized entities. Training aims at obtaining representations that can fool the discriminator, therefore achieving frequency-agnostics and entity-type-specificity.

Local Context Reconstruction
Contrary to the conventional methods that explicitly provide abundant lexical features from external knowledge, we implicitly enrich word representations with contextual information by training them to reconstruct their local contexts.
Masking Every entity word x i in X, which is defined to be not associated with non-entity label "O", in sequence X is firstly randomly masked with OOV symbol "[UNK]" as follows: where constant p is a threshold and is uniformly sampled between 0 and 1.

Forward Reconstruction
In the forward reconstruction process, the forward pass of Eq.
(1) is firstly applied on sequence X u = [x u 1 , x u 2 , · · · , x u N ] to obtain hidden states − → h u i . Then, a forward span representation, m f jk , of the local context between position k and j is obtained using RNN-minus feature (Wang and Chang, 2016) as follows: To apply VAE to reconstruct the local context, the mean µ and log-variance log σ are firstly computed from the above representation as follows: where W * * are all learnable matrices. Then, the reparameterization trick in Eq. (4) is applied on µ f jk and σ f jk = exp(log σ f jk ) to obtain a global latent variable z f jk for the local context.
To generate the i-th word in the local context sequence [x j+1 , x j+2 , · · · , x k−1 ], we first apply a RNN-decoder with its initial hidden state from the latent variable z f jk and the first observation from the embedding of "[SOS]" symbol to recursively obtain hidden state − → r f i as follows: This RNN-decoder specifically does parameter sharing with the forward pass RNN-encoder in Eq.
(1). We then use softmax to compute the distribution of word at position l as where W f g is a learnable matrix. Lastly, we compute KL distance and crossentropy for length-L local context sequence in Eq. (5) as follows: where d denotes hidden dimension index and the closed form KL divergence ζ is defined as Backward Reconstruction Same as the forward reconstruction, the backward reconstruction is applied on non-adjacent successive entities. The backward pass of Eq. (1) is firstly applied on the entitymasked sequence X u . Once the backward span representation, m b kj , of the local context between position k and j is obtained as m b kj = ← − h u j − ← − h u k , the same procedures of the above described forward reconstruction are conducted, except using the backward RNN-encoder ← − f (·) in lieu of the forward RNN-encoder in Eq. (9).
The objective for local context reconstruction is which is to maximize the ELBO w.r.t. parameters θ lcr and θ rnn .

Delexicalized Entity Identification
For low-frequency entities, the delexicalized entity identification aims at obtaining frequency-agnostic and entity-type-specific representations.

Delexicalization
We first randomly substitute entity words in input sequence X with their corresponding labels as where p is a threshold and is uniformly sampled from [0, 1]. We refer this to delexicalization (Wen et al., 2015), but insert randomness in it.

Representation for Identification
To obtain representation to identify whether an entity has been delexicalized to its label, we first use forward and backward RNN-encoders in Eq. (1) For an entity with a span from position j to k, its representation e d jk is obtained from the following average pooling Average pooling is also applied on h i s to obtain e jk for the original entity with that span.
Discriminator A multi-layer perceptron (MLP) based discriminator with parameter θ D is employed to output a confidence score in [0, 1], indicating the probability of the delexicalization of an entity; i.e., where paramters v d and W d are learnable and σ(x) is Sigmoid function 1 1+exp(−x) . Update θ lcr and θ rnn by gradient ascent to joint maximization of J vae + J at according to Eqs. (13) and (17).
Following the principle of adversarial training, we develop the following minimax objective to train RNN model θ rnn and the discriminator θ D : which aims at fooling a strong discriminator θ D with parameter θ rnn optimized, leading to frequency-agnostics.

Training Algorithm
Notice that the model has three modules with their own objectives. We update their parameters jointly using Algorithm 1. The algorithm first improves discriminator θ D to identify delexicalized items. It then updates θ lcr and θ rnn with joint optimization J vae and J at to improve θ rnn to fool the discriminator. As VAE optimization of J vae has posterior collapse problem, we adopt KL cost annealing strategy and word dropout techniques (Bowman et al., 2015). Finally, the algorithm updates both of θ rnn and θ emb in Bi-LSTM+CRF by gradient ascent according to Eq. (2). Note that θ lcr shares the same parameters with θ rnn and θ emb . During experiments, we also find it is beneficial to have a few epochs of pretraining of parameters θ rnn and θ emb with optimization of Eq. (2).

Experiments
This section compares the proposed model against state-of-the-art models on benchmark datasets.
Baselines We compare the proposed model with five types of methods: 1) strong baseline (Lample et al., 2016) use character embedding to improve sequence tagger; 2) recent state-of-the-art models for slot filling (Qin et al., 2019;Liu et al., 2019b) that utilize multi-task learning to incorporate additional information from intent detection; 3) recent state-of-the-art models, including  and Liu et al. (2019a), for NER; 4) Bi-LSTM + CRF model augmented with external resources, (i.e., POS tagging using Stanford Parser 2 ); and 5) Bi-LSTM + CRF model with word embedding from fine-tuned BERT LARGE (Devlin et al., 2018). Results are reported in F1 scores. We follow most of the baseline performances reported in (Lample et al., 2016;Liu et al., 2019b;Qin et al., 2019;Liu et al., 2019a) and rerun the open source toolkit NCRFpp 3 , LM-LSTM-CRF 4 , and GCDT 5 on slot filling tasks 6 . Implementation Details We use the same configuration setting for all datasets. The hidden dimensions are set as 500. We apply dropout to hid-den states with a rate of 0.3. L2 regularization is set as 1 × 10 −6 to avoid overfit. Following (Liu et al., , 2019a, we adopt the cased, 300d Glove (Pennington et al., 2014) to initialize word embeddings. We utilize Adam algorithm (Kingma and Ba, 2015) to optimize the models and adopt the suggested hyper-parameters.

Main Results
The main results of the proposed model on ATIS and CoNLL-03 are illustrated in Table 2. The proposed model outperforms all other models on all tasks by a substantial margin. On slot filling tasks, the model obtains averaged improvements of 0.15 points on ATIS and 1.53 points on SNIPS over CM-Net and Stack-propagation, without using extra information from jointly modeling of slots and intents in these models. In comparison to the prior stateof-the-art models of GCDT, the improvements are 0.03 points on ATIS, 2.17 points on SNIPS and 0.71 points on CoNLL-03.
Compared with strong baseline (Lample et al., 2016) that utilizes char embedding to improve Bi-LSTM + CRF, the gains are even larger. The model obtains improvements of 0.84 points on ATIS, 3.49 points on SNIPS and 1.73 points on CoNLL-03, over Bi-LSTM + CRF and LM-LSTM-CRF.
Finally, we have tried improving the baseline Bi-LSTM+CRF in our model with external resources of lexical information, including part-of-speech tags, chunk tags and character embeddings. However, their F1 scores are consistently below the proposed model by an average of 1.47 points. We also replace word embeddings in Bi-LSTM+CRF with those from fine-tuned BERT LARGE but its results are worse than the proposed model, by 0.07  points, 1.05 points and 0.14 points, respectively, on ATIS, SNIPS, and CoNLL-03.

Analysis
It is noteworthy that the substantial improvements by the model are obtained without using external resources nor large pre-trained models. Keys to its success are local context reconstruction and delexicalized entity identification. This section reports our analysis of these modules.

Ablation Study
Local Context Reconstruction (LCR) We first examine the impact bought by the LCR process.
In Table 3, we show that removing LCR (w/o LCR) hurts performance significantly on SNIPS. We then study if constructing local context in LCR using a traditional deterministic encoder-decoder can be equally effectively as using VAE. We make a good faith attempt of using LSTM-based language model (Sundermeyer et al., 2012) to generate local context directly from local context representation (w/o VAE, w/ LSTM-LM). This does improve results over that without LCR at all, indicating the information from reconstructing local context is indeed useful. However, its F1 score is still far worse than that of using VAE. This confirms that VAE is superior to deterministic model in dealing with the inherently one-to-many generation of local context from entities. Lastly, we examine the impact of OOV masking and observe that F1 score without it (w/o OOV masking) drops about 1.6 point below the model. We attribute this improvement from OOV masking to mitigating the entity distribution discrepancy between training and test.  These results show that both local context reconstruction and delexicalized entity identification contribute greatly to the improved performance by the proposed model. Because both LCR and DEI share the same RNN-encoder as the baseline Bi-LSTM, the information from reconstructing local context and fooling the discriminator of delexicalization is useful for the Bi-LSTM to better predict sequence labels.

Rare Entity Handling
In this section, we compare models specifically by the numbers of OOV and LF entities they can recall correctly. Such comparison reveals the capability of each model in handling rare entities.
Results are presented in Table 4. In the case of without using any external resource and pre-trained models, the proposed model recalls 3.66% more OOV entities and 3.96% more LF entities than LM-LSTM-CRF. This gain is similar when comparing against Bi-LSTM+CRF. Furthermore, the proposed model also recalls more rare entities than GCDT, a recent state-of-the-art model in NER. Separately using LCR or DEI improves performance over baseline Bi-LSTM+CRF. Their gains are complementary as results show that jointly applying LCR and DEI obtains the best performance. These results demonstrate convincingly the capability of local context reconstruction and delexicalized entity identification in rare entities.
Importantly, results in the last two rows reveal that potentially large improvements can be potentially achieved since there are still 15.34% of OOV entities and 13.35% of LF entities not recalled. Figure 3: Visualization of learned representations on CoNLL-03 test dataset. Entity types are represented in different shapes with red for PER, blue for ORG, green for LOC and orange for MISC. Rare entities are represented using bigger points. The points with "X" are for the delexicalized entities.

Representation for Delexicalized Entity Identification
We visualize the learned representation at Eq. (15) using t-SNE (Maaten and Hinton, 2008) in Figure 3. It shows 2-dimensional projections of randomly sampled 800 entities on CoNLL-03 dataset. Figure 3 clearly shows separability of entities by their entity types but no separations among lowfrequency and frequent entities. This observation is consistent to the mini-max objective in Eq. (17) to learn entity-type-specific and frequency-agnostic representations.

Handling Data Scarcity
This section investigates the proposed model on data scarcity. On ATIS, the percentage of training samples are reduced down to 20% of the original size, with a reduction size of 20%. This setting is challenging and few previous works have experimented. Results in Figure 3 show that the proposed model consistently outperforms other models, especially in low-resource conditions. Furthermore, reductions of performance from the proposed model are much smaller, in comparison to other models. For instance, at percentage 40%, the proposed model only lose 1.17% of its best F1 score whereas GCDT loses 3.62% of its F1 score. This suggests that the proposed model is more robust to low resource than other models.

Related Work
Neural sequence labeling has been an active field in NLP, and we briefly review recently proposed approaches related to our work.
Slot Filling and NER Neural sequence labeling has been applied to slot filling (Mesnil et al., 2014;Zhang and Wang, 2016;Liu and Lane, 2016;Qin et al., 2019) and NER (Huang et al., 2015;Strubell et al., 2017;Devlin et al., 2018;Liu et al., 2019a). For slot filling, multi-task learning for joint slot filling and intent detection has been dominating in the recent literature, for example (Liu and Lane, 2016). The recent work in (Liu et al., 2019b) employs a collaborative memory network to further model the semantic correlations among words, slots and intents jointly. For NER, recent works use explicit architecture to incorporate information such as global context (Liu et al., 2019a) or conduct optimal architecture searches (Jiang et al., 2019). The best performing models have been using pre-training models on large corpus (Baevski et al., 2019) or incorporating fine-tuning on existing pre-trained models (Liu et al., 2019a) such as BERT (Devlin et al., 2018).
External Resource This approach to handle rare entities includes feature engineering methods such as incorporating extra knowledge from part-ofspeech tags (Huang et al., 2015) or character embeddings . Extra knowledge also includes tags from public tagger . Multi-task learning has been effective in incorporating additional label information through multiple objectives. Joint slot filling and intent detection have been used in (Zhang and Wang, 2016;Qin et al., 2019;. Joint part-ofspeech tagging and NER have been used in (Lin et al., 2018).
Transfer Learning This approach refers to methods that transfer knowledge from highresources to low-resources (Zhou et al., 2019) or use models pretrained on large corpus to benefit downstream tasks (Devlin et al., 2018;Liu et al., 2019a). The most recent work in (Zhou et al., 2019) applies adversarial training that uses a resourceadversarial discriminator to improve performances on low-resource data.

Conclusion
We have presented local context reconstruction for OOV entities and delexicalized entity identification for low-frequency entities to address the rare entity problem. We adopt variational autoencoder to learn a stochastic reconstructor for the reconstruction and adversarial training to extract frequency-agnostic and entity-type-specific features. Extensive experiments have been conducted on both slot filling and NER tasks on three benchmark datasets, showing that sequence labeling using the proposed methods achieve new state-of-the-art performances. Importantly, without using external knowledge nor fine tuning of large pretrained models, our methods enable a sequence labeling model to outperform models fine-tuned on BERT. Our analysis also indicates large potential of further performance improvements by exploiting OOV and LF entities.