Named Entity Recognition Only from Word Embeddings

Deep neural network models have helped named entity (NE) recognition achieve amazing performance without handcrafting features. However, existing systems require large amounts of human annotated training data. Efforts have been made to replace human annotations with external knowledge (e.g., NE dictionary, part-of-speech tags), while it is another challenge to obtain such effective resources. In this work, we propose a fully unsupervised NE recognition model which only needs to take informative clues from pre-trained word embeddings. We first apply Gaussian Hidden Markov Model and Deep Autoencoding Gaussian Mixture Model on word embeddings for entity span detection and type prediction, and then further design an instance selector based on reinforcement learning to distinguish positive sentences from noisy sentences and refine these coarse-grained annotations through neural networks. Extensive experiments on CoNLL benchmark datasets demonstrate that our proposed light NE recognition model achieves remarkable performance without using any annotated lexicon or corpus.


Introduction
Named Entity (NE) recognition is a major natural language processing task that intends to identify words or phrases that contain the names of PER (Person), ORG (Organization), LOC (Location), etc. Recent advances in deep neural models allow us to build reliable NE recognition systems (Lample et al., 2016;Ma and Hovy, 2016;Yang and Zhang, 2018). However, these existing methods require large amounts of manually annotated data for training supervised models. There have been efforts to deal with the lack of annotation data in NE recognition, (Talukdar and Pereira, 2010) train a weak supervision * Corresponding author. model and use label propagation methods to identify more entities of each type; (Shen et al., 2017) employ Deep Active Learning to efficiently select the set of samples for labeling, thus greatly reduce the annotation budget; (Ren et al., 2015;Fries et al., 2017;Yang et al., 2018b;Jie et al., 2019) use partially annotated data or external resources such as NE dictionary, knowledge base, POS tags as a replacement of hand-labeled data to train distant supervision systems. However, these methods still have certain requirements for annotation resources. Unsupervised models have achieved excellent results in the fields of part-of-speech induction (Lin et al., 2015;Stratos et al., 2016), dependency parsing (He et al., 2018;Pate and Johnson, 2016), etc. Whereas the development of unsupervised NE recognition is still kept unsatisfactory. (Liu et al., 2019) design a Knowledge-Augmented Language Model for unsupervised NE recognition, they perform NE recognition by controlling whether a particular word is modeled as a general word or as a reference to an entity in the training of language models. However, their model still requires typespecific entity vocabularies for computing the type probabilities and the probability of the word under given type.
Early unsupervised NE systems relied on labeled seeds and discrete features (Collins and Singer, 1999), open web text (Etzioni et al., 2005;Nadeau et al., 2006), shallow syntactic knowledge (Zhang and Elhadad, 2013), etc. Word embeddings learned from unlabeled text provide representation with rich syntax and semantics and have shown to be valuable as features in unsupervised learning problems (Lin et al., 2015;He et al., 2018). In this work, we propose an NE recognition model with word embeddings as the unique feature source. We separate the entity span detection and entity type prediction into two steps.
We first use Gaussian-HMM to learn the latent Markov process among NE labels with the IOB tagging scheme and then feed the candidate entity mentions to a Deep Autoencoding Gaussian Mixture Model (DAGMM) (Zong et al., 2018) for their entity types. We further apply BiLSTM and an instance selector based on reinforcement learning (Yang et al., 2018b;Feng et al., 2018) to refine annotated data. Different from existing distant supervision systems (Ren et al., 2015;Fries et al., 2017;Feng et al., 2018), which generate labeled data from NE lexicons or knowledge base which are still from human annotation, our model may be further enhanced by automatically labeled data from Gaussian-HMM and DAGMM. The contribution of this paper is that we propose a fully unsupervised NE recognition model which depends on no external resources or annotation data other than word embeddings. The empirical results show that our model achieves remarkable results on two benchmark datasets.
The rest of this paper is organized as follows. The next section introduce our proposed basic model in detail. Section 3 further gives a refinement model. Experimental results are reported in Section 4, followed by related work in Section 5. The last section concludes this paper.

Model
As shown in Figure 1, the first layer of the model is a two-class clustering layer, which initializes all the words in the sentences with 0 and 1 tags, where 0 and 1 represents non-NE and NE, respectively. The second layer is a Gaussian-HMM used to generate the boundaries of an entity mention with IOB tagging (Inside, Outside and Beginning). The representation of each candidate entity span is further fed into a Deep Autoencoding Gaussian Mixture Model (DAGMM) to identify the entity types.

Clustering
The objective of training word embeddings is to let words with similar context occupy close spatial positions. (Seok et al., 2016) conduct experiments on the nearest neighbors of NEs and discover that similar NEs are more likely to be their neighbors, since NEs are more similar in position in the corpus and syntactically and semantically related. Based on the discoveries above, we perform K-Means clustering algorithm on the word embeddings of the whole vocabulary. Accord-ing to the clusters, we assign words in the cluster with less words 1 tags, and the other cluster 0 tags (according to the statics of (Jie et al., 2019), the proportion of NEs is very small on CoNLL datasets.), and generate a coarse NE dictionary using the words with 1 tags.

Gaussian HMM
Hidden Markov model is a classic model for NE recognition (Zhou and Su, 2002;Zhao, 2004), since hidden transition matrix exists in the IOB format of the NE labels (Sarkar, 2015). We follow the Gaussian hidden Markov model introduced by (Lin et al., 2015;He et al., 2018). Given a sentence of length l, we denote the latent NE labels as z = {z i } l i=1 , the cluster embeddings as , transition parameters as θ. The joint distribution of observations and latent labels is given as following: is the multinomial transition probability, p(x i |z i ) is the multivariate emission probability, which represents the probability of a particular label generating the embedding at position i. Cluster features (0, 1 tags) carry much word-level categorization information and can indicate the distribution representation, which we map to 3-dimension cluster embeddings v ∈ R 2×3 through a lookup table. Gaussian Emissions Given a label t ∈ {B, I, O}, we adopt multivariate Gaussian distribution with mean µ t and covariance matrix Σ t as the emission probability. The conditional probability density is in a form as: (2) where d is the dimension of the embeddings, | · | denotes the determinant of a matrix. The equation assumes that embeddings of words labeled as t are concentrated around the point µ t , and the concentration is attenuated according to the covariance matrix Σ t .
During training, we maximize the joint distribution over a sequence of observations x, cluster sequence v and the latent label sequence z: We present two techniques to refine the output of Gaussian-HMM.
Single-word NEs We check the experimental results of Gaussian-HMM and discover that they perform well on the recognition of multi-word NEs, but inferiorly on single-word NEs, which are incorrectly given many false-positive labels, so we need to do further word-level discrimination. For a single-word NE identified by the above model, if it is less than half of the probability of being marked as an NE in the corpus and does not appear in the coarse NE dictionary generated in section 2.1, then we modify it to a non-NE type. Through this modification, the precision is greatly improved without significantly reducing the recall.
High-Quality phrases Another issue of the above models is the false-negative labels, a long NE may be divided into several short NEs, in which case we need to merge them with phrase matching. We adopt a filter to determine high quality phrases according to word co-occurrence information in the corpus: where p(·) represents the frequency of one word appearing in the corpus, n is the total number of words and T is the threshold, which is set as the default value in word2vec for training phrase embeddings. After obtaining candidate entity span mentions, we represent them by separating words in them into two parts, the boundary and the internal (Sohrab and Miwa, 2018). The boundary part is important to capture the contexts surrounding the region, we directly take the word embedding as its representation. For the internal part, we simply average the embedding of each word to treat them equally. In summary, given the word embeddings x i , we obtain the representation x(i, j) of N E(i, j) as follows:

DAGMM
After obtaining candidate entity mentions, we need to further identify their entity types. Gaussian Mixture Model (GMM) is one of the methods to learn the distribution of each entity type. Experimental result of (Zong et al., 2018) inspired us that it is more efficient to perform density estimation in the low-dimensional space, in which case the distribution of words are denser and more suit for GMM. Therefore, we adopt Deep Autoencoding Gaussian Mixture Model (DAGMM) (Zong et al., 2018) to identify NE types. DAGMM consists of two major components: compression network utilizes a deep autoencoder to perform dimension reduction and concatenate the reduced low-dimensional representation and the reconstruction error features as the representations for the estimation network; The estimation network takes the low-dimension representation as input, and uses GMM to perform density estimation.
Compression network contains an encoder function for dimension reduction and a decode function for reconstruction, both of which are multi-layer perceptron (MLP), and we use tanh function as the activation function. Given an NE representation x, the compression generates its low-dimensional representation t as follows.
where θ e and θ d are respectively the parameters of the encoder and decoder, x is the reconstruction counterpart of x, f (·) denotes the reconstruction error, we take the concatenation of relative Euclidean distance and cosine similarity as t r in our experiment. t is then fed into the input layer of estimation network. Intuitively, we need to make the reconstruction low to ensure that the low-dimensional representations preserve the key information of the NE representations. Thus the reconstruction error is taken as part of the loss function and is designed as the L 2 -norm.
Estimation network contains an MLP to predict the mixture membership for each instance and a GMM with unknown mixture-component distribution φ, mixture means µ and covariance matrix Σ t for density prediction. During the training phase, the estimation network estimates the parameters of GMM and evaluates the likelihood for the instances. Given the low-dimensional representations t and the number of entity types K as the number of mixture components, MLP first maps the representation to the K-dimension space: where θ m is the parameter of MLP,γ is a Kdimension vector for the soft mixture-component membership prediction. Given a batch of N instances, the estimation network estimates the parameters of GMM as follows (∀1 ≤ k ≤ K) whereγ i is the membership prediction for t i , and φ k ,μ k ,σ k are mixture probability, mean, covariance for component k in GMM, respectively. The likelihood for the instance is inferred by To avoid the diagonal entries in covariance matrices degenerating to 0, we penalize small values on the diagonal entries by where d is the dimension of the low-dimensional representation t. During training, we minimize the joint objective function: where λ 1 and λ 2 are two user-tunable parameters. Actually, the final output is the result of K (the number of entity types) classification. We can only identity whether a word is an NE and whether several NEs are of the same category, since the entity type names as any other userdefined class/cluster/type names are just a group of pre-defined symbols by subjective naming. Therefore, following most work of unsupervised partof-speech induction such as (Lin et al., 2015), we use matching to determine the corresponding entity category of each class, just for evaluation. Figure 2: The framework of the reinforcement learning model, which consists of two parts. The left instance selector filters sentences according to a policy function, and then the selected sentences are used to train a better NE tagger. The instance selector updates its parameters based on the reward computed from NE tagger.

Refinement
The annotations obtained from above procedure are noisy, we apply Reinforcement Learning (RL (Feng et al., 2018;Yang et al., 2018b) to refine the labels. The RL model has two modules: an NE tagger and an instance selector.

NE Tagger
Given the annotations generated by the above model, we take it as the noisy annotated label to train the NE tagger. Following (Lample et al., 2016;Yang et al., 2018a;Yang and Zhang, 2018), we employ bi-directional Long Short-Term Memory network (BiLSTM) for sequence labeling. In the input layer, we concatenate the word-level and character-level embedding as the joint word representation. We employ BiLSTM as the encoder, the concatenation of the forward and backward net- is fed into an MLP, and then feed the output of MLP to a CRF layer.
CRF (Lafferty et al., 2001) has been included in most sota models, which captures label dependencies by adding transition scores between adjacent labels. During the decoding process, the Viterbi algorithm is used to search the label sequence with the highest probability. Given a sentence of length l, we denote the input sequence x = {x 1 , ..., x l }, where x i stands for the i th word in sequence x. For y = {y 1 , ..., y l } being a predicted sequence of labels for x. We define its score as where T y i ,y i+1 represents the transmission score from the y i to y i+1 , P i,y i is the score of the j th tag of the i th word from the BiLSTM. A softmax over all possible tag sequences in the sentences generates a probability for the sequence y: during the training, we consider the maximum loglikelihood of the correct NE tag sequence. While decoding, we predict the optimal sequence which achieves the maximum score:

Instance Selector
The instance selection is a reinforcement learning process, where the instance selector acts as the agent and interacts with the environment (instances) and the NE tagger. Given all the instances, the agent makes an action to decide which instance to select according to a policy network at each state, and receives a reward from the NE tagger when a batch of N instances have been selected. State representation. We follow the work of (Yang et al., 2018b) and represent the state s t as the concatenation of the serialized vector representations of current instance from BiLSTM and the label scores from the MLP layer.
Policy network. The agent makes an action a t from set of {0, 1} to indicate whether the instance selector will select the t th instance. We adopt a logistic function as the policy function: (16) where W and b are the model parameters, and σ(·) stands for the logistic function.
Reward. The reward function indicates the ability of the NE tagger to predict labels of the selected instances and only generates a reward when all the actions of the given N instances have been completed.
where N is the number of instances in one batch, and y is the sequence label.
Training During the training phase, we optimize the policy network to maximize the reward of the selected instances. The parameters are updated as follows, where α is the learning rate. We train the NE tagger and instance selector iteratively. In each round, the instance selector first selects instances from the training data, and then the positive instances are used to train the NE tagger, the tagger updates the reward to the selector to optimize the policy function. Different from the work of (Yang et al., 2018b), we remark the negative instances by the NE tagger after each round, and merge them with the positive instances for the next selection.

Experiments
We conduct experiments on two standard NER datasets: CoNLL 2003 English datasets and CoNLL 2002 Spanish datasets that consist of news articles. These datasets contain four entity types: LOC (location), MISC (miscellaneous), ORG (organization), and PER (person). We adopt the standard data splitting and use the micro-averaged F 1 score as the evaluation metric.

Setup
Pre-trained Word Embeddings. For the CoNLL 2003 dataset, we use the pre-trained 50D SENNA embeddings released by (Collobert et al., 2011) and 100D GloVe (Pennington et al., 2014) embeddings for clustering and training, respectively. For CoNLL 2002 Spanish datasets, we train 64D GloVe embeddings with the minimum frequency of occurrence as 5, and the window size of 5. Parameters and Model Training. For Gaussian-HMM, we initialize the cluster embedding v 2×3 as [[1, 0, 0], [0, 0.5, 0.5]] (we fix the 'O', 'B', 'I' tag as the first, second and last class), which means that if the cluster tag of a word is 0, we initialize the word with all the probability of being 'O' tag, otherwise it will be half of the probability of being 'B' tag and 'I' tag, we fine-tune this embedding during the training. For DAGMM, the hidden dimensions for compression network and estimation network are [75,15] and 10, respectively. For NE Tagger, we follow the work of (Yang and Zhang, 2018) and use the default experimental settings. We conduct optimization with the stochastic gradient descent, the learning rate is initially set to 0.015 and will shrunk by 5% after each epoch. The batch size and dropout are set as 10 and 0.5, respectively. (Jie et al., 2019) propose a approach to tackle the incomplete annotation problem. This work introduces q distribution to model missing labels instead of traditionally uniform distribution for all possible complete label sequences, and uses k-fold cross-validation for estimating q. They report the result of keeping 50% of all the training data and removing the annotations of the rest entities together with the O labels for non-NEs.. (Liu et al., 2019) propose a Knowledge-Augmented Language Model (KALM), which recognizes NEs during training language models. Given type-specific entity vocabularies and the general vocabulary, KALM computes the entity probability of the next word according to its context. This work extracts 11,123 vocabularies from WikiText-2 as the Knowledge base. WikiText-2 is a standard language modeling dataset and covers 92.80% of entities in CoNLL 2003 dataset. ) is a distant supervision NE recognition model using domain-specific dictionary. This work designs a Tie or Break tagging scheme that focuses on the ties between adjacent tokens. Accordingly, AutoNER is designed to distinguish Break from Tie while skipping Unknown  positions. The authors report their evaluation results on datasets from a specific domain and their method needs necessary support from an NE lexicon. For better comparisons, we use the lexicon from the training data, the SENNA lexicon presented by (Collobert et al., 2011) and our handcraft lexicon 1 as the domain-specific dictionary to reimplement their work on CoNLL 2003 English dataset, the size of each category in each lexicon is shown in Table 3. Due to the resource constraints, we only extract the lexicon in training data without labeling a larger dictionary for for Spanish. Supervised benchmarks are represented to show the gap between supervised and our unsupervised model without any annotation data or external resources. LSTM-CRF (Lample et al., 2016)

Results and Comparisons
We present F 1 , precision, and recall scores in Table 1 and Table 2. All the models compared in Table 1 besides ours need extra resources to some extent, like partially annotated training data, NE dictionary, etc. While our model achieves comparable results without using using any resources mentioned above. We compare the predict results for each entity type with (Liu et al., 2019) in Table 2, and it is shown that our model performs well in LOC, ORG and PER types. These NEs have specific meanings, and more similar in position and length in the corpus, thus their word embeddings can better capture semantic and syntactic regularities, and thus better represent the words, while MISC includes various entity types which may bring significant confusion on learning type patterns. (Liu et al., 2019) can better regularize the type information from NE dictionaries and retrained type information.
Though  achieve better results when using gold NE dictionary for English, they perform poorly on SENNA and our manually annotated dictionary. Specially, when using the gold NE dictionary for training Spanish dataset, the result is especially unsatisfactory. According to our statistics, over half of the MISC NEs in CoNLL 2002 Spanish training data are labeled as other types in the same dataset, while the ratio is 28% in CoNLL 2003 English dataset, thus the results differs a lot in the two datasets. Our models achieve much better performance than those of  by more than doubling their F 1 scores in general NE dictionary (SENNA and human-labeled Wikipedia dictionary). Besides, our unsupervised NE recognition method is shown more general and gives more stable performance than the distant supervision model in , which highly relies on the quality of the support dictionary and the domain relevance of the dictionary to the corpus.
We acknowledge that there still exists a gap between our unsupervised NE recognition model with he state-of-the-art supervised model (Lample et al., 2016;Jie et al., 2019), but the applicability of unsupervised models and the robustness of resource dependence are unreachable by supervised models. Table 3 lists the results of entity span detection. Our Gaussian-HMM absorbs informative clue from clustering, and greatly improves the results of entity span detection. For English dataset, we apply SENNA embedding, which is trained on English Wikipedia and Reuters RCV1 corpus, thus the result of clustering becomes better, leading to a better result of Gaussian-HMM. While for Spanish dataset, the embedding is trained on Wikipedia corpus only, which has little connection with the CoNLL 2002 datasets, thus the re-sult is slightly lower. We have also tried language models such as ELMo and BERT as encoders for Gaussian-HMM, but their sparse characteristics in high-dimensional space is not conducive to Gaussian modeling. Unsupervised models have fewer parameters and simpler training phase, thus there is no guarantee that the language model will retain its key properties when it is reduced to low dimensions. Overall, unsupervised modeling based on word embeddings may be more general and robust than dictionary-based and corpus-based modeling.

Related work
Deep neural network models have helped peoples released from handcrafted features. LSTM-CRF (Lample et al., 2016;Ma and Hovy, 2016) is the most state-of-the-art model for NE recognition. In order to reduce the requirements of training corpus, distant supervised models Yang et al., 2018b;Ren et al., 2015;Fries et al., 2017) have been applied to NE recognition. Recently, (Liu et al., 2019) proposed a Knowledge-Augmented Language Model, which trains language models and at the same time compute the probability of the next word being different entity types according to the context given typespecific entity/general vocabularies. Unlike these existing approaches, our study focuses on unsupervised NE recognition learning without any extra resources.
Noisy data is another important factor affecting the neural network models, reinforcement learning has been applied to many tasks, (Feng et al., 2018) use reinforcement learning for Relation Classification from Noisy Data. (Yang et al., 2018b) show how to apply reinforcement learning in NE recognition systems by using instance selectors, which can tell high-quality training sentences from noisy data. Their work inspires us to use reinforcement leaning after obtaining coarse annotated data from Gaussian-HMM.

Conclusion
This paper presents an NE recognition model with only pre-trained word embeddings and achieves remarkable results on CoNLL 2003 English and CoNLL 2002 Spanish benchmark datasets. The proposed approach yields, to the best of our knowledge, first fully unsupervised NE recognition work on these two benchmark datasets without any annotation data or extra knowledge base.