Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks

In this paper, we utilize the linguistic structures of texts to improve named entity recognition by BRNN-CNN, a special bidirectional recursive network attached with a convolutional network. Motivated by the observation that named entities are highly related to linguistic constituents, we propose a constituent-based BRNN-CNN for named entity recognition. In contrast to classical sequential labeling methods, the system first identifies which text chunks are possible named entities by whether they are linguistic constituents. Then it classifies these chunks with a constituency tree structure by recursively propagating syntactic and semantic information to each constituent node. This method surpasses current state-of-the-art on OntoNotes 5.0 with automatically generated parses.

In this paper, we utilize the linguistic structures of texts to improve named entity recognition by BRNN-CNN, a special bidirectional recursive network attached with a convolutional network. Motivated by the observation that named entities are highly related to linguistic constituents, we propose a constituent-based BRNN-CNN for named entity recognition. In contrast to classical sequential labeling methods, the system first identifies which text chunks are possible named entities by whether they are linguistic constituents. Then it classifies these chunks with a constituency tree structure by recursively propagating syntactic and semantic information to each constituent node. This method surpasses current state-of-the-art on OntoNotes 5.0 with automatically generated parses.

Introduction
Named Entity Recognition (NER) can be seen as a combined task of locating named entity chunks of texts and classifying which named entity category a chunk falls into. Traditional approaches label each token in texts as a part of a named entity chunk, e.g. "person begin", and achieve high performances in several benchmark datasets (Ratinov and Roth, 2009;Passos et al., 2014;Chiu and Nichols, 2016).
Being formulated as a sequential labeling problem, NER systems could be naturally implemented by recurrent neural networks. These networks process a token at a time, taking, for each token, the hidden features of its previous token as well as its raw features to compute its own hidden features. Then they classify each token by these hidden features. With both forward and backward directions, networks learn how to propagate the information of a token sequence to each token. Chiu and Nichols (2016) utilize a variation of recurrent networks, bidirectional LSTM, attached with a CNN, which learns character-level features instead of handcrafting. They accomplish state-of-the-art results on both CoNLL-2003 (Tjong Kim Sang andDe Meulder, 2003) and OntoNotes 5.0 (Hovy et al., 2006;Pradhan et al., 2013) datasets.
Classical sequential labeling approaches take little information about phrase structures of sentences. However, according to our analysis, most named entity chunks are actually linguistic constituents, e.g. noun phrases. This motivates us to focus on a constituent-based approach for NER where the NER problem is transformed into a named entity classification task on every node of a for child in node.children constituency structure.
To classify constituents and take into account their structures, we propose BRNN-CNN, a special bidirectional recursive neural network attached with a convolutional network. For each sentence, a constituency parse where every node represents a meaningful chunk of the sentence, i.e. a constituent, is first generated. Then BRNN-CNN recursively computes hidden state features of every node and classifies each node by these hidden features. To capture structural linguistic information, bidirectional passes are applied so that each constituent sees what it is composed of as well as what is containing it, both in a near-to-far fashion.
Our main contribution is the introduction of a novel constituent-based BRNN-CNN for named entity recognition, which successfully utilizes the linguistic structures of texts by recursive neural networks. We show that it achieves better scores than current state-of-the-art on OntoNotes 5.0, where good parses can be automatically generated. Additionally, we analyze the effects of only considering constituents and the effects of constituency parses. Collobert et al. (2011) achieved near state-ofthe-art performance on CoNLL-2003 NER with an end-to-end neural network which had minimal feature engineering and external data. Chiu and Nichols (2016) achieved the current state-of-theart on both CoNLL-2003 and OntoNotes 5.0 NER with a sequential bidirectional LSTM-CNN. They also did extensive studies of additional features such as character type, capitalization, and Senna and DBpedia lexicons.

Related Work
Finkel and Manning (2009) explored training a parser for an NER-suffixed grammar, jointly tackling parsing and NER. They achieved competitive results on OntoNotes with a CRF-CFG parser.
Recursive neural networks have been successfully applied for parsing and sentiment analysis on Stanford sentiment treebank (Socher et al., 2010(Socher et al., , 2013aTai et al., 2015). Their recursive networks, such as RNTN and Tree-LSTM, do sentiment combinations on phrase structures in a bottom-up fashion, showing the potential of such models in computing semantic compositions.

Method
For each input sentence, our constituent-based BRNN-CNN first extracts features from its constituency parse, then recursively classifies each constituent, and finally resolves conflicting predictions.

Preparing Constituency Structures
For a sentence and its associated constituency parse, our system first sets three features for each node: a POS, a word, and a head. While constituency tags and words should come readily, semantic head words are determined by a rule-based head finder (Collins, 1999). Additionally, a fourth feature vector is added to each node to utilize lexicon knowledge. The 3-bit vector records if the constituent of a node matches some phrases in each of the three SENNA (Collobert et al., 2011) lexicons of persons, organizations, and locations.
The system then tries to generate more plausible constituents while preserving linguistic structures by applying a binarization process which groups excessive child nodes around the head children. The heuristic is that a head constituent is usually modified by its siblings in a near to far fashion. Algorithm 1 shows the recursive procedure called for the root node of a parse. Figure 1 shows the application of the algorithm to the parse of senator Edward Kennedy. With the heuristic that Edward modifies the head node Kennedy before senator. The binarization process successfully adds a new node Edward Kennedy that corresponds to a person name.

Computing Word Embeddings
For each word, our network retrieves one embedding from a trainable lookup table initialized by GloVe (Pennington et al., 2014). However, to capture the morphology information of a word and help dealing with unseen words, the network computes another character-level embedding. Inspired by Kim et al. (2016), the network passes onehot character vectors through a series of convolutional and highway layers to generate the embedding. These two embeddings are concatenated as the final embedding of a word.

Computing Hidden Features
Given a constituency parse tree, where every node represents a constituent, our network recursively computes two hidden state features for every node.
First, for each node i with left sibling l and right sibling r, the raw feature vector I i is formed by concatenating the one-hot POS vectors of i, l, r, the head embeddings of i, l, r, the word embedding of i, the lexicon vector of i, and the mean of word embeddings in the sentence. Then, with the the set of child nodes C and the parent node p, the hidden feature vectors H bot,i and H top,i are computed by 2 hidden layers: where W s are weight matrices, bs are bias vectors, and ReLU (x) = max(0, x). In cases when some needed neighboring nodes do not exist, or when i is a nonterminal and does not have a word, zero  vectors are used as the missing parts of raw or hidden features. Figure 2 shows the applications of the equations to the binarized tree in Figure 1. The computations are done recursively in two directions. The bottom-up direction computes the semantic composition of the subtree of each node, and the topdown counterpart propagates to that node the linguistic structures which contain the subtree. Together, hidden features of a constituent capture its structural linguistic information.
In addition, each hidden layer can be extended to a deep hidden network. For example, a 2-layer top-down hidden network is given by where tα represents the first top-down hidden layer and tβ represents the second. Our best model is tuned to have 3 layers for both directions.

Forming Consistent Predictions
Given hidden features for every node, our network computes a probability distribution of named en-tity classes plus a special non-entity class by an output layer. For each node i with left sibling l and right sibling r, the probability distribution O i is computed by an output layer: where H x = H bot,x + H top,x , x ∈ {i, l, r}, and σ(x) = (1 + e −x ) −1 . If a sibling does not exist, zero vectors are used as its hidden states. Should deep hidden layers be deployed, the last hidden layer is used. Finally, the system makes predictions for a sentence by collecting the constituents whose most probable classes are named entity classes. However, nodes whose ancestors are already predicted as named entities are ignored to prevent predicting overlapping named entities.

Training and Tuning
To train the model, we minimize the cross entropy loss of the softmax class probabilities in Equation 3 by the Adam optimizer (Kingma and Ba, 2014).
Other details such as hyperparameters are documented in the supplemental materials as well as the public repository.

OntoNotes 5.0 NER
OntoNotes 5.0 annotates 18 types of named entities for diverse sources of texts. Like other previous work (Durrett and Klein, 2014;Chiu and Nichols, 2016), we use the format and the trainvalidate-test split provided by CoNLL-2012. In addition, both gold and auto parses are available. Table 3 shows the dataset statistics. The last column shows the percentages of named entities that correspond to constituents of auto parses before and after binarization. Table 1 and Table 2 compare our results with others on the whole dataset and different sources respectively. The sample mean, standard deviation, and sample count of BRNN-auto and Chiu and Nichols' model are 87.10, 0.14, 3 and 86.41, 0.22, 10 respectively. By one-tailed Welch's Ttest, the former significantly surpasses the latter with 99% confidence level (0.000489 p-value).

Analysis of the Approach
The training and validation sets contain 1,236,227 tokens and 92,894 named entities, of which 90,371 correspond to some constituents of binarized auto parses. This backs our motivation that more than 97% named entities are linguistic constituents, and 52,729 of them are noun phrases.
Essentially, the constituent-based approach filters out the other 3% named entities that cross constituent boundaries (Figure 3), i.e. 3% loss of recall. We dig into this problem by analyzing a sequential labeling recurrent network (the sixth model in Table 1). The simple model performs reasonably well, but its non-constituent predictions are mostly false positive. In fact, it slightly improves if all non-constituent predictions are re- Figure 3: Two sample named entities that cross different branches of syntax parses. moved in post-processing, i.e., the precision gain of focusing on constituents is more significant than the recall loss. This is one advantage of our system over other sequential models, which try to learn and predict non-constituent named entities but do not perform well.
In addition, to analyze the effects of constituency structures, we test our models with different qualities of parses (gold vs. auto in Table 1). The significant F1 differences suggest that structural linguistic information is crucial and can be learned by our model.

Conclusion
We have demonstrated a novel constituentbased BRNN-CNN for named entity recognition which successfully utilizes constituency structures and surpasses the current state-of-the-art on OntoNotes 5.0 NER. Instead of propagating information by word orders as normal recurrent networks, the model is able to recursively propagate structural linguistic information to every constituent. Experiments show that when a good parser is available, the approach will be a good alternative to traditional sequential labeling tokenbased NER. Named entities that cross constituent boundaries are analyzed and we find out that a naïve sequential labeling model has difficulty predicting them without too many false positives. While avoiding them is one of the strengths of our model, generating more consistent parses to reduce this kind of named entities would be one possible direction for future research.