Similarity Based Auxiliary Classifier for Named Entity Recognition

The segmentation problem is one of the fundamental challenges associated with name entity recognition (NER) tasks that aim to reduce the boundary error when detecting a sequence of entity words. A considerable number of advanced approaches have been proposed and most of them exhibit performance deterioration when entities become longer. Inspired by previous work in which a multi-task strategy is used to solve segmentation problems, we design a similarity based auxiliary classifier (SAC), which can distinguish entity words from non-entity words. Unlike conventional classifiers, SAC uses vectors to indicate tags. Therefore, SAC can calculate the similarities between words and tags, and then compute a weighted sum of the tag vectors, which can be considered a useful feature for NER tasks. Empirical results are used to verify the rationality of the SAC structure and demonstrate the SAC model’s potential in performance improvement against our baseline approaches.


Introduction
Named entity recognition (NER) focuses on extracting a specific sequence and then classifying the sequence as a predefined category. As a fundamental and important task in natural language understanding and information extraction, NER can be considered a sequence labelling task, which includes part-of-speech (POS) tagging, chunking, and semantic role labelling (SRL) (Collobert et al., 2011). Owing to the growing popularity of deep learning techniques, a considerable number of neural network-based approaches and traditional conditional random fields (CRF) have been proposed and are widely used in many sequence labelling tasks to obtain magnificent results (Ye and Ling, 2018;Wu et al., 2018;Ghaddar and Langlais, 2018). Organization? Figure 1: The segmentation problem in NER. "State Street Bank and Trust Company" is an organization entity with six words, it would be unsatisfactory if we consider "State Street Bank" and "Trust Company" to be two entities To enhance NER's performance, significant effort has been devoted in this area, and one direction is to employ useful information at the input end. This mechanism can be realized by including hand-crafted features (Collobert et al., 2011;Passos et al., 2014;Huang et al., 2015). To eventually reduce heavy feature engineering, some researchers have argued for the exclusion of taskspecific features and for the automatic mining of character level features and word level features (Ma and Hovy, 2016;. Such end-to-end models can potentially improve performance with the exclusion of hand-crafted features. While character and word level features demonstrate promising results, Ye and Ling (2018) determined that such methods (i.e. LM-BiLSTM-CRF ) will be relatively less effective for longer entity recognition. As indicated by Zhuo et al. (2016), NER is a segment-level task, and such NER models should be able to extract segment information while word-level labels might tend to learn more word level information than segment level information. Such limitations can manifest as entity boundary errors, referred to as the segmentation problem. This problem is illustrated in Figure 1.
There have been many attempts to address the segmentation problem. Some researchers employed segment level information, e.g., Semi-CRF (SCRF) (Zhuo et al., 2016), to address this challenge. Though SCRF exhibits promising results in the case of longer entity recognition, it requires additional feature extractors and exhibits some limitations in the case of shorter entity recognition. Ye and Ling (2018) proposed using CRF jointly with SCRF to address this problem. On the other hand, researchers have also adopted a multi-task strategy to predict boundaries. For example, Stratos (2016) divides NER into a two-step task in which the boundary is first predicted and then the entity type of the boundary is determined. Though this method is easy to implement, it relies heavily on the accuracy of the first step. Another alternative thought is to employ auxiliary tasks to address this problem. For example, Aguilar et al. (2017) created two auxiliary tasks to help determine entity boundaries and types. However, the authors considered the auxiliary tasks to be supervisors, which means that the outputs of the auxiliary tasks are not fully utilized.
In this study, inspired by the research indicated previously, we design an auxiliary classification task that aims to distinguish entity words from non-entity words before executing the main NER task. Unlike prior studies, the outputs of the auxiliary task do not directly determine the entity boundaries but rather serve as features for assisting the main task. Therefore, the auxiliary task is more like an "advisor" rather than supervisor to the entire model. Our auxiliary classifier is different from conventional classifiers, which compute the probability of an entity word using the sof tmax function. Instead, we initialize two trainable vectors for the classifier to identify "True" and "False" tags, and then we determine the similarity scores between each word and the two tag vectors to indicate the probability whether the word is an entity word. The hypothesis is that tag vectors can serve as helpful features that are shared by a majority of entity words and nonentity words in NER tasks. Finally, we obtain the weighted sum of the tag vectors by using the similarity scores. In this study, this process is referred to as a similarity based auxiliary classifier (SAC). 1 The empirical results verify the rationality of SAC's structure and demonstrate that the SAC model can reduce segmentation problems and exhibit good performance in recognizing enti-1 code available at https://github.com/XiaoShiyuan/NCRF-SAC ties of different lengths.
The contributions of this paper are as follows: 1) We introduce a similarity based auxiliary classifier (SAC), which allows the model to determine whether a word is an entity word before predicting its entity label to overcome the segmentation problem. 2) By analyzing the experimental results, we verify the rationality of SAC's structure and demonstrate the advantages of using the weighted sum of tag vectors compared to a list of scores denoting classification results. 3) Without using any hand-crafted features, NeuralCRF+SAC outperforms existing state-of-the-art end-to-end models on CoNLL 2003 and Ontonotes 5.0. When using external resource like ELMo, NeuralCRF+SAC still exhibits comparable results.
2 Related work NER is a fundamental task, and traditional methods usually adopt statistical machine learning models e.g., Hidden Markov Models (HMM), Maximum-entropy Markov Models (MEMM), and Conditional Random Fields (CRF) (Bikei et al., 1998;McCallum et al., 2000;Lafferty et al., 2001). To improve overall performance, many hand-crafted features such as POS chunking tags and gazetteers are required and integrated into the task (Chieu and Ng, 2002;Florian et al., 2003;Ando and Zhang, 2005).
With the growing popularity of deep learning techniques, neural networks have been widely adopted for NER tasks. Collobert et al. (2011) initiated the trend of using neural networks for sequence labelling tasks. Huang et al. (2015) combined BiLSTM with CRF for NER tasks. Because of the powerful feature mining ability of neural networks, some end-to-end models employ neural structures to extract character features and obtain state-of-the-art results without using hand-crafted features (Lample et al., 2016;Ma and Hovy, 2016;. Hand-crafted features such as POS tags, word capitalization, and lexicons are commonly used (Chiu and Nichols, 2016;Ghaddar and Langlais, 2018;Wu et al., 2018). It was recently proved that using contextualized representations from pretrained language models can significantly improve performance (Peters et al., 2017(Peters et al., , 2018Devlin et al., 2018).
To address the segmentation problem, several methods have been proposed, among which Semi-CRF(SCRF) is a successful solution for segment-  (Zhuo et al., 2016). Ye and Ling (2018) proposed an improved version of SCRF, namely HSCRF. Using a multi-task strategy to combine HSCRF and CRF, the HSCRF model exhibits state-of-the-art results without using external resources. In particular, HSCRF resolves the segmentation problem by matching entity lengths.
In addition, there are other approaches that consider multi-task learning to predict entity boundaries. Stratos (2016) divides NER into two tasks. First, CRF is used to predict the location and boundary of entities and to classify the entities. Aguilar et al. (2017) used the outputs of their feature extractor to perform segmentation, categorization, and then sequential classification. Our method exploits the features of multi-task learning and proposes a similarity based auxiliary classifier (SAC). Compared to existing studies that usually consider auxiliary tasks to be supervisors, we also use the output of SAC to support the NER task. Meanwhile, we employ vectors to denote "True" and "False" tags, which contain more useful information compared to simply using "1" and "0" to denote tags.

Overall Architecture
We implement the framework based on BiLSTM-CNN-CRF(NeuralCRF) (Ma and Hovy, 2016), a robust and widely used neural network in English NER tasks, and we build our SAC parallel to BiLSTM. The overall architecture of Neural-CRF+SAC is presented in Figure 2. Embedding layer. Denoting input sequences as X = {x 1 , x 2 , ..., x T }, we first obtain the word-level vector W = {w 1 , w 2 , ..., w T }. Next, a Char-CNN is employed to obtain character-level features C = {c w 1 , c w 2 , ..., c w T } (Santos and Zadrozny, 2014; Chiu and Nichols, 2016; Ma and Hovy, 2016). Finally, we obtain [w i ; c w i ] to denote x i , where [;] indicates concatenation. Feature layer. In this layer, we employ BiLSTM to extract contextual features and SAC to generate weighted vectors. The calculations can be expressed as follows: Here, we use SAC to denote all operations of the classifier and we will introduce it in detail in Section 3.2. The three outputs are then concatenated and operated as where Linear denotes a linear calculation. CRF layer. Because of the ability to utilize information from neighboring labels, CRF is widely employed in sequence tagging tasks. Denoting label sequences as Y = {y 1 , y 2 , ..., y T }, the collection of all possible labels as Y, and semantic feature sequence as Z = {z 1 , z 2 , ..., z T }, CRF can be described as a family of conditional probabilities p(y|Z): is the potential function and Z = {z 1 , z 2 , ..., z T } are the outputs of the feature layer.
For the training process, we minimize the negative log likelihood: while the Viterbi algorithm (Forney, 1973) is used to decode the best label sequence.

Similarity based Auxiliary Classifier
We believe that SAC can concentrate on one word and on the nearest neighbors of the word rather than on long-term dependency because we believe that information from the word itself and its nearest neighbors contribute the most in determining whether the word is an entity word. Compared to RNN, CNN exhibits a distinct advantage that  it can accurately control the length of the context, which satisfies our requirement. Therefore, SAC is based on CNN. The structure of this module is presented in Figure 3. To prevent the loss of positional information while using CNN, we implement a position encoder (Vaswani et al., 2017) The new embeddings are then sent to the convolutional block to extract uni-gram, bi-gram and tri-gram information. We denote the output of this convolutional block as Intuitively, we adopt a multiplicative attention mechanism (Luong et al., 2015) for computing similarity. For the purpose of classifying entity words and non-entity words, we use t i , (i ∈ {0, 1}) to denote non-entity words and entity words, respectively. Then, we randomly initialize two tag vectors v i ∈ R d h , (i ∈ {0, 1}) as semantic representations of t i , (i ∈ {0, 1}). We express the computation as follows: where W n ∈ R d h ×n , W a ∈ R n×(n+d h ) and b a ∈ R n . Innovatively, we use a ij , the similarity score between i th tag and h c j as the classification result.  For instance, if the j th word is an entity word, its feature representation h c j must have a higher similarity score with v 1 . Therefore, SAC operates in a manner that is similar to a "supervised cluster," separating entity words from non-entity words, and denoting two "clustering centers" with two tag vectors. According to Equation 8 and 9, classification information related to the j th word can be expressed as a weighted sum s j and is then concatenated with h c j . Finally, we obtain the output H a = {h a 1 , h a 2 , ..., h a T }.
To train this auxiliary classifier, we define attention loss as follows: By combining L CRF and L A , we obtain the loss function: We obtain the best and most stable λ = 0.05 when using the IOBES tag scheme.

Experiments Setup
Dataset. We performed experiments on CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003), Ontonotes 5.0 (Hovy et al., 2006) and WNUT 2017 shared tasks (Derczynski et al., 2017). CoNLL 2003 is a widely used dataset comprising data extracted from Reuters news articles. Ontonotes 5.0 is a large-scale dataset comprising data collected from news, weblogs, etc. WNUT 2017 is a small dataset consisting primarily of emerging entities. All datasets have been separated into training/dev/test sets and they include 4/18/6 entity types, respectively. Table 1 presents some statistics of the 3 datasets.
Configuration. All models were implemented with TensorFlow 2 . We used GloVe (Pennington    Table 2 and an early stopping strategy was adopted. The main experiments were tested for 5 times to obtain the mean and standard deviation of F1 score. Evaluation. After training, the latest 50 checkpoints were preserved and we obtained the checkpoint with the highest word-level F1-score to make predictions on the test set. Following this, we used the official CoNLL evaluation script on the CoNLL 2003 dataset and ontonotes 5.0 and the WNUT evaluation script on the WNUT 2017 dataset.

Development experiments
To verify the rationality of the structure of SAC and the role of the tag vectors, we designed two development experiments on the CoNLL 2003 dataset. One experiment compared the CNN structure in SAC with the BiLSTM structure. The other one compared SAC with a modified TextCNN.
Comparing with an RNN structure We believe that if a word is an entity word, its nearest neighbors are very likely to be entity words as well. Although there are many one-word enti-3 http://nlp.stanford.edu/data/glove.840B.300d.zip ties, these entities always share an obvious feature such as capital letters, which can be easily captured by a char-CNN. On the contrary, if a word is a non-entity word, under most circumstances, there would be at most one entity word among the adjacent words. However, when considering longterm dependency, a word can be affected by several former or latter words, which could weaken the influence of adjacent words and introduce unnecessary noise. Therefore, we chose CNN instead of RNN in SAC and we set the max kernel size to 3.
To verify if a CNN structure is more suitable in SAC, we replace the convolutional block with BiLSTM. In Table 3, we can observe that the maximum F1 value of the entire model is 91.65 when employing a CNN structure compared to 91.20 when using BiLSTM. To determine the reason for a large performance drop, we compare the performances of the two structures in the face of different entity lengths. From Figure 4, we can observe that if SAC uses a BiLSTM structure, the overall performance will still deteriorate as entities become long, indicating that SAC appears to be redundant for the entire model. Maybe it is because BiLSTM also exists in NeuralCRF, and using BiL-STM in SAC may cause NeuralCRF and SAC to obtain duplicate features.   Comparing with TextCNN In addition, we employed a conventional classifier like TextCNN (Kim, 2014) to replace SAC. As is known, TextCNN predicts a set of scores for each sentence according to the number of categories. The position of the largest score indicates the category of the sentence. As is shown in Figure 5, to enable TextCNN to predict scores for each word, and to make a fair comparison with SAC, we remove the pooling layer of TextCNN and adopt the same convolutional structure as SAC. Meanwhile, both outputs of TextCNN and SAC include features mined by their convolutional layers. The difference lies in the other part of their outputs in which TextCNN includes a series of scores while SAC provides the weighted sum of the tag vectors. Table 4 demonstrates that NeuralCRF+SAC obtains a maximum F1 score of 91.65 while Neural-CRF+TextCNN receives a score of 91.47. However, we can observe that TextCNN exhibits higher accuracy on its own classification task. We must declare that the accuracy of SAC has different meanings in terms of the accuracy of TextCNN. For TextCNN, an accuracy score of 98.29 only means that TextCNN assigns 98.29% of the words with the right tags. However, for SAC, an accuracy score of 97.87% also means that 97.87% of   the words share common traits in which a majority of entity words are similar to the "True" tag vector, while the majority of non-entity words are similar to "False" tag vector. Therefore, because of the higher F1 score obtained from NeuralCRF+SAC, the two tag vectors can be viewed as features that are helpful for the NER task. Just like our baseline approach, NeuralCRF, uses char-CNN to mine character-level features. SAC automatically mines features that are shared by entity words or nonentity words.

Results on CoNLL 2003 dataset
End-to-end.   cause of the popularity of contextualized representation, we introduce EMLo 5 in the embedding layer as a form of external resources and results, which are presented in Table 6. Observe that our model still exhibits a 0.37 F1 improvement over the base model, which verifies the effects of the SAC. Despite using BERT(LARGE), we still obtain state-of-the-art results, even though it requires a considerable number of resources to be expended. In contrast, our method is flexible and resources friendly.

Results on Ontonotes 5.0 dataset
We present the experimental results on Ontonotes 5.0 in Table 7. Without using external resources, our model exhibits a mean F1 score of 87.24, which still exceeds the previous end-to-end model. When employing ELMo as external resources, we obtain a mean F1 score of 89.05. Comparing to NeuralCRF, SAC brings 0.59 and 0.64 F1 improvements respectively.

Results on WNUT 2017 shared task
Experimental results on the WNUT17 dataset are presented in Table 8. Owing to the facts that this task focuses on unusual and previously-unseen entities, all existing models incorporate some handcrafted features. We include ELMo representations in this experiment. SAC exhibits a 0.84 improvement in F1(entity) and 1.49 improvement in 5 https://allennlp.org/elmo

Analysis
Performance against entity length. We believe that before using CRF to predict the final label of a word, using a classifier to determine whether the word is an entity word helps in addressing problems related to segmentation, particularly when entities become long. Following Ye and Ling (2018)'s study, we compared the performance of NeuralCRF and NeuralCRF+SAC in identifying entities of different lengths on three datasets. The experimental results of the CoNLL 2003, Ontonotes 5.0, and WNUT 2017 datasets are presented in Figure 6. To eliminate the impact of external resources, we use the end-to-end approach as far as possible. But because the WNUT 2017 datasets are relatively small and most of them are emerging entities, which makes both   Observe that Figure 6(a) indicates a big improvement when entity lengths are equal to or greater than 4 and Figures 6(b) and 6(c) illustrate even improvements against different lengths.
In addition, we compared NeuralCRF+SAC with Ye and Ling (2018)' work in Table 9. The two models perform comparably when the entity length is less than 4, but NeuralCRF+SAC exhibits even improvements when the entity length is equal to or larger than 4, which demonstrates that SAC can improve the overall performance in long entity recognition.
Case study. We examined the manner in which SAC handles long entities. Table 10 presents three cases from 3 datasets, respectively. We observe that these 3 cases are quite similar and NeuralCRF makes almost the same mistake on each of them. Generally speaking, character level features are quite powerful for the NER task that words written in uppercase letters are easily recognized as entity words. In addition, if all the letters of a word are lowercase letters, they are more likely to be classified as non-entity words. Therefore, we can observe that cases in which meeting words like "of", "is" and "and" are too normal to be considered entity words, it is very easy for NeuralCRF to predict an "O" label for these words. In fact, many long entities exhibit a pattern in which almost all words are written in uppercase letters except one. Therefore, it is common that the segmentation problem grows in severity as entities become longer. However, these three examples indicate that SAC can address this problem to some extent. We obtain the similarity scores of these examples and present them in Figure 7. It is obvious that SAC assigns a high similarity score to "of", "is" and "and" even though they hardly include any features of entity words.
A possible reason is that the CNN structure of SAC plays a role. As discussed in Section 4.2, CNN can control the length of the context. When the kernel sizes are set to 1, 2 and 3, CNN layers can extract uni-gram, bi-gram and tri-gram features that enable SAC to learn a considerable number of samples that exhibit the same structure as "Time of London." Therefore, SAC reduces truncation errors for long entities caused by lowercase words.

Conclusion
In this work, we extended the BiLSTM-CNN-CRF architecture with SAC to address the segmentation problem. By computing the similarity scores between words and two tag vectors, respectively, SAC provides the weighted sum of tag vectors together with features it mined to support the main NER task. Because of the higher accuracy of SAC, the two tag vectors can be considered features that are shared between entity words and non-entity words. By reducing the boundary errors, SAC addresses the segmentation problem and improves overall performance, particularly when entities become long. Experimental results demonstrate that NeuralCRF+SAC surpasses previous state-of-theart end-to-end models on the CoNLL 2003 and Ontonotes 5.0 datasets without using external resources. Moreover, when including ELMo, SAC still improves overall performance and is comparable to current state-of-the-art models.