Leveraging Knowledge Bases in LSTMs for Improving Machine Reading

This paper focuses on how to take advantage of external knowledge bases (KBs) to improve recurrent neural networks for machine reading. Traditional methods that exploit knowledge from KBs encode knowledge as discrete indicator features. Not only do these features generalize poorly, but they require task-specific feature engineering to achieve good performance. We propose KBLSTM, a novel neural model that leverages continuous representations of KBs to enhance the learning of recurrent neural networks for machine reading. To effectively integrate background knowledge with information from the currently processed text, our model employs an attention mechanism with a sentinel to adaptively decide whether to attend to background knowledge and which information from KBs is useful. Experimental results show that our model achieves accuracies that surpass the previous state-of-the-art results for both entity extraction and event extraction on the widely used ACE2005 dataset.


Introduction
Recurrent neural networks (RNNs), a neural architecture that can operate over text sequentially, have shown great success in addressing a wide range of natural language processing problems, such as parsing (Dyer et al., 2015), named entity recognition (Lample et al., 2016), and semantic role labeling (Zhou and Xu, 2015)). These neural networks are typically trained end-to-end where the input is only text or a sequence of words and a lot of background knowledge is disregarded.
The importance of background knowledge in natural language understanding has long been recognized (Minsky, 1988;Fillmore, 1976). Earlier NLP systems mostly exploited restricted linguistic knowledge such as manually-encoded morphological and syntactic patterns. With the advanced development of knowledge base construction, large amounts of semantic knowledge become available, ranging from manually annotated semantic networks like WordNet 1 to semi-automatically or automatically constructed knowledge graphs like DBPedia 2 and NELL 3 . While traditional approaches have exploited the use of these knowledge bases (KBs) in NLP tasks (Ratinov and Roth, 2009;Rahman and Ng, 2011;, they require a lot of task-specific engineering to achieve good performance.
One way to leverage KBs in recurrent neural networks is by augmenting the dense representations of the networks with the symbolic features derived from KBs. This is not ideal as the symbolic features have poor generalization ability. In addition, they can be highly sparse, e.g., using WordNet synsets can easily produce millions of indicator features, leading to high computational cost. Furthermore, the usefulness of knowledge features varies across contexts, as general KBs involve polysemy, e.g., "Clinton" can refer to a person or a town. Incorporating KBs irrespective of the textual context could mislead the machine reading process.
Can we train a recurrent neural network that learns to adaptively leverage knowledge from KBs to improve machine reading? In this paper, we propose KBLSTM, an extension to bidirec-tional Long Short-Term Memory neural networks (BiLSTMs) (Hochreiter and Schmidhuber, 1997;Graves et al., 2005) that is capable of leveraging symbolic knowledge from KBs as it processes each word in the text. At each time step, the model retrieves KB concepts that are potentially related to the current word. Then, an attention mechanism is employed to dynamically model their semantic relevance to the reading context. Furthermore, we introduce a sentinel component in BiLSTMs that allows flexibility in deciding whether to attend to background knowledge or not. This is crucial because in some cases the text context should override the context-independent background knowledge available in general KBs.
In this work, we leverage two general, readily available knowledge bases: WordNet (WordNet, 2010) and NELL . Word-Net is a manually created lexical database that organizes a large number of English words into sets of synonyms (i.e. synsets) and records conceptual relations (e.g., hypernym, part of) among them. NELL is an automatically constructed webbased knowledge base that stores beliefs about entities. It is organized based on an ontology of hundreds of semantic categories (e.g., person, fruit, sport) and relations (e.g., personPlaysInstrument). We learn distributed representations (i.e., embeddings) of WordNet and NELL concepts using knowledge graph embedding methods. We then integrate these learned embeddings with the state vectors of the BiLSTM network to enable knowledge-aware predictions.
We evaluate the proposed model on two core information extraction tasks: entity extraction and event extraction. For entity extraction, the model needs to recognize all mentions of entities such as person, organization, location, and other things from text. For event extraction, the model is required to identify event mentions or event triggers 4 that express certain types of events, e.g., elections, attacks, and travels. Both tasks are challenging and often require the combination of background knowledge and the text context for accurate prediction. For example, in the sentence "Maigret left viewers in tears.", knowing that "Maigret" can refer to a TV show can greatly help disambiguate its meaning. However, knowledge bases may hurt performance if used blindly. For example, in the sentence "Santiago is charged with murder.", methods that rely heavily on KBs are likely to interpret "Santiago" as a location due to the popular use of Santiago as a city. Similarly for events, the same word can trigger different types of events, for example, "release" can be used to describe different events ranging from book publishing to parole. It is important for machine learning models to learn to decide which knowledge from KBs is relevant given the context.
Extensive experiments demonstrate that our KBLSTM models effectively leverage background knowledge from KBs in training BiLSTM networks for machine reading. They achieve significant improvement on both entity and event extraction compared to traditional feature-based methods and LSTM networks that disregard knowledge in KBs, resulting in new state-of-the-art results for entity extraction and event extraction on the widely used ACE2005 dataset.

Related Work
Essential to RNNs' success on natural language processing is the use of Long Short-Term Memory neural networks (Hochreiter and Schmidhuber, 1997) (LSTMs) or Gated Recurrent Unit (Cho et al., 2014) (GRU) as they are able to handle longterm dependencies by adaptively memorizing values for either long or short durations. Their bidirectional variants BiLSTM (Graves et al., 2005) or BiGRU further allow the incorporation of both past and future information. Such ability has been shown to be generally helpful in various NLP tasks such as named entity recognition (Huang et al., 2015;Chiu and Nichols, 2016;Ma and Hovy, 2016), semantic role labeling (Zhou and Xu, 2015), and reading comprehension (Hermann et al., 2015;Chen et al., 2016). In this work, we also employ the BiLSTM architecture.
In parallel to the development for text processing, neural networks have been successfully used to learn distributed representations of structured knowledge from large KBs (Bordes et al., 2011Socher et al., 2013;Yang et al., 2015;Guu et al., 2015). Embedding the symbolic representations into continuous space not only makes KBs more easy to use in statistical learning approaches, but also offers strong generalization ability. Many attempts have been made on connecting distributed representations of KBs with text in the context of knowledge base completion (Lao et al., 2011;Gardner et al., 2014;Toutanova et al., 2015), relation extraction Chang et al., 2014;Riedel et al., 2013), and question answering (Miller et al., 2016). However, these approaches model text using shallow representations such as subject/relation/object triples or bag of words. More recently, Ahn et al. (2016) proposed a neural knowledge language model that leverages knowledge bases in RNN language models, which allows for better representations of words for language modeling. Unlike their work, we leverage knowledge bases in LSTMs and applies it to information extraction.
The architecture of our KBLSTM model draws on the development of attention mechanisms that are widely employed in tasks such as machine translation (Bahdanau et al., 2015) and image captioning . Attention allows the neural networks to dynamically attend to salient features of the input. With a similar motivation, we employ attention in KBLSTMs to allow for dynamic attention to the relevant knowledge given the text context. Our model is also closely related to a recent model of caption generation introduced by Lu et al. (2017), where a visual sentinel is introduced to allow the decoder to decide whether to attend to image information when generating the next word. In our model, we introduce a sentinel to control the tradeoff between background knowledge and information from the text.

Method
In this section, we present our KBLSTM model. We first describe several basic recurrent neural network frameworks for machine reading, including basic RNNs, LSTMs, and bidirectional LSTMs (Sec. § 3.1). We then introduce our extension to bidirectional LSTMs that allows for the incorporation of KB information at each time step of reading (Sec. § 3.2). The KB information is encoded using continuous representations (i.e., embeddings) which are learned using knowledge embedding methods (Sec. § 3.3).

RNNs, LSTMs, and Bidirectional LSTMs
RNNs are a class of neural networks that take a sequence of inputs and compute a hidden state vector at each time step based on the current input and the entire history of inputs. The hidden state vector can be computed recursively using the following equation (Elman, 1990): where x t is the input at time step t, h t is the hidden state at time step t, U and W are weight matrices, and F is a nonlinear function such as tanh or ReLu. Depending on the applications, RNNs can produce outputs based on the hidden state of each time step or just the last time step.
A Long Short-Term Memory network (Hochreiter and Schmidhuber, 1997) (LSTM) is a variant of RNNs which was design to better handle cases where the output at time t depends on much earlier inputs. It has a memory cell and three gating units: an input gate that controls what information to add to the current memory, a forget gate which controls what information to remove from the previous memory, and an output gate which controls what information to output from the current memory. Each gate is implemented as a logistic function σ that takes as input the previous hidden state and the current input, and outputs values between 0 and 1. Multiplication with these values controls the flow of information into or out of the memory. In equations, the updates at each time step t are: where i t is the input gate, f t is the forget gate, o t is the output gate, c t is the memory cell, and h t is the hidden state. denotes element-wise multiplication.
Bidirectional LSTMs (Graves et al., 2005) (BiLSTMs) are essentially a combination of two LSTMs in two directions: one operates in the forward direction and the other operates in the backward direction. This leads to two hidden states − → h t and ← − h t at time step t, which can be viewed as a summary of the past and the future respectively. Their provides a whole summary of the information about the input around time step t. Such property is attractive in NLP tasks, since information of both previous words and future words can be helpful for interpreting the meaning of the current word. Figure 1: Architecture of the KBLSTM model. As each time step t, the knowledge module retrieves a set of candidate KB concepts V (x t ) that are related to the current input x t , and then computes a knowledge state vector m t that integrates the embeddings of the candidate KB concepts v 1 , v 2 , ..., v L and the current context vector s t . See Section § 3.2 for details.

Knowledge-aware Bidirectional LSTMs
Our model (referred to as KBLSTM) extends BiL-STMs to allow flexibility in incorporating symbolic knowledge from KBs. Instead of encoding knowledge as discrete features, we encode it using continuous representations. Concretely, we learn embeddings of concepts in KBs using a knowledge graph embedding method. (We will describe the details in Section § 3.3). The KBLSTM model then retrieves the embeddings of candidate concepts that are related to the current word being processed and integrates them into its state vector to make knowledge-aware predictions. Figure 1 depicts the architecture of our model.
The core of our model is the knowledge module, which is responsible for transferring background knowledge into the BiLSTMs. The knowledge at time step t consists of candidate KB concepts V (x t ) for input x t . (We will describe how to obtain the candidate KB concepts from NELL and WordNet in Section § 3.3). Each candidate KB concept i ∈ V (x t ) is associated with a vector embedding v i . We compute an attention weight α ti for concept vector v i via a bilinear operator, which reflects how relevant or important concept i is to the current reading context h t : where W v is a parameter matrix to be learned.
Note that the candidate concepts in some cases are misleading. For example, a KB may store the fact that "Santiago" is a city but miss the fact that it can also refer to a person. Incorporating such knowledge in the sentence "Santiago is charged with murder." could be misleading. To address this issue, we introduce a knowledge sentinel that records the information of the current context and use a mixture model to allow for better tradeoff between the impact of background knowledge and information from the context. Specifically, we compute a sentinel vector s t as: where W b and U b are weight parameters to be learned. The weight on the local context is computed as: where W s is a parameter matrix to be learned. The mixture model is defined as: where i∈V (xt) α ti +β t = 1. m t can be viewed as a knowledge state vector that encodes external KB information with respect to the input at time t. We combine it with the state vector h t of BiLSTMs to obtain a knowledge-aware state vectorĥ t : If V (x t ) = ∅, we set m t = 0.ĥ t can be used for predictions in the same way as the original state vector h t (see Section § 4 for details).

Embedding Knowledge Base Concepts
Now we describe how to learn embeddings of concepts in KBs. We consider two KBs: WordNet and NELL, which are both knowledge graphs that can be stored in the form of RDF 5 triples. Each triple consists of a subject entity, a relation, and an object entity. Examples of triples in WordNet are (location, hypernym of, city), and (door, has part, lock), where both the subject and object entities are synsets in WordNet. Examples of triples in NELL are (New York, located in, United States) and (New York, is a, city), where the subject entity is a noun phrase that can refer to a real-world entity and the object entity can be either a noun phrase entity or a concept category.
In this work, we refer to the synsets in WordNet and the concept categories in NELL as KB concepts. They are the key components of the ontologies and provide generally useful information for language understanding. As our KBLSTM model reads through each word in a sentence, it retrieves knowledge from NELL by searching for entities with the current word and collecting the related concept categories as candidate concepts; and it retrieves knowledge from WordNet by treating the synsets of the current word as candidate concepts.
We employ a knowledge graph embedding approach to learn the representations of the candidate concepts. Denote a KB triple as (e 1 , r, e 2 ), we want to learn embeddings of the subject entity e 1 , the object entity e 2 , and the relation r, so that the relevance of the triple can be measured by a scoring function based on the embeddings. We employ the BILINEAR model described in (Yang et al., 2015). 6 It computes the score of a triple (e 1 , r, e 2 ) via a bilinear function: S (e 1 ,r,e 2 ) = v T e 1 M r v e 2 , where v e 1 and v e 2 are vector embeddings for e 1 and e 2 respectively, and M r is a relation-specific embedding matrix. We train the embeddings using the max-margin ranking objective: q=(e 1 ,r,e 2 )∈T q =(e 1 ,r,e 2 )∈T max{0, 1 − S q + S q } (7) where T denotes the set of triples in the KB and T denotes the "negative" triples that are not observed in the KB.
For WordNet, we train the concept embeddings using the preprocessed data provided by , which contains 151,442 triples with 40,943 synsets and 18 relations. We use the same data splits for training, development, and testing. During training, we use AdaGrad (Duchi et al., 2011) to optimize objective 7 with an initial learning rate of 0.05 and a mini-batch size of 100. At each gradient step, we sample 10 negative object entities with respect to the positive triple. Our implementation of the BILINEAR model achieves top-10 accuracy of 91% for predicting missing ob- 6 We also experimented with TransE  and NTN (Socher et al., 2013), and found that they perform significantly worse than the Bilinear method, especially on predicting the "is a" facts in NELL. ject entities on the WordNet test set, which is comparable with previous work (Yang et al., 2015).
For NELL, we train the concept embeddings using a subset of the NELL data 7 . We filter noun phrases with annotation confidence less than 0.9 in order to remove erroneous labels introduced during the automatic construction of NELL (Wijaya, 2016). This results in 180,107 noun phrases and 258 concept categories in total. We randomly split 80% of the data for training, 10% for development and 10% for testing. For each training example, we enumerate all the unobserved concept categories as negative labels. Instead of treating each entity as a unit, we represent it as an average of its constituting word vectors concatenated with its head word vector. The word vectors are initialized with pre-trained paraphrastic embeddings (Wieting et al., 2015) and fine-tuned during training. Using the same optimization parameters as before, the BILINEAR model achieves 88% top-1 accuracy for predicting concept categories of given noun phrases on the test set.

Entity Extraction
We first apply our model to entity extraction, a task that is typically addressed by assigning each word/token BIO labels (Begin, Inside, and Outside) (Ratinov and Roth, 2009) indicating the token's position within an entity mention as well as its entity type.
To allow tagging over phrases instead of words, we address entity extraction in two steps. The first step detects mention chunks, and the second step assigns entity type labels to mention chunks (including an O type indicating other types). In the first stage, we train a BiLSTM network with a conditional random field objective (Huang et al., 2015) using gold-standard BIO labels regardless of entity types, and only predict each token's position within an entity mention. This produces a sequence of chunks for each sentence. In the second stage, we train another supervised BiLSTM model to predict type labels for the previously extracted chunks. Each chunk is treated as a unit input to the BiLSTMs and the input vector is computed by averaging the input vectors of individual words within the chunk concatenated with its head word vector. The BiLSTMs can be trained with a softmax objective that minimizes the crossentropy loss for each individual chunk. It computes the probability of the correct type for each chunk: The BiLSTMs can also be trained with a CRF objective (referred to as BiLSTM-CRF) that minimizes the negative log-likelihood for the entire sequence. It computes the probability of the correct types for a sequence of chunks: where g(x, y) = l t=1 P t,yt + l t=0 A yt,y t+1 , P t,yt = w T yt h t is the score of assigning the t-th chunk with tag y t and A yt,y t+1 is the score of transitioning from tag y t to y t+1 . By replacing h t in Eq. 8 and Eq. 9 with the knowledge-aware state vectorĥ t (Eq. 6), we can compute the objective for KBLSTM and KBLSTM-CRF respectively.

Implementation Details
We evaluate our models on the ACE2005 corpus (LDC, 2005) and the OntoNotes 5.0 corpus (Hovy et al., 2006) for entity extraction. Both datasets consist of text from a variety of sources such as newswire, broadcast conversations, and web text. We use the same data splits and task settings for ACE2005 as in  and for OntoNotes 5.0 as in Durrett and Klein (2014).
At each time step, our models take as input a word vector and a capitalization feature (Chiu and Nichols, 2016). We initialize the word vectors using pretrained paraphrastic embeddings (Wieting et al., 2015), as we find that they significantly outperforms randomly initialized embeddings. The word embeddings are fine-tuned during training. For the KBLSTM models, we obtain the embeddings of KB concepts from NELL and WordNet as described in Section § 3.3. These embeddings are kept fix during training.
We implement all the models using Theano on a single GPU. We update the model parameters on every training example using Adam with default settings (Kingma and Ba, 2014) and apply dropout to the input layer of the BiLSTM with a rate of 0.5. The word embedding dimension is set to 300 and the hidden vector dimension is set to 100. We train models on ACE2005 for about 5 epochs and on OntoNotes 5.0 for about 10 epochs with early stopping based on development results. For each experiment, we report the average results over 10 random runs. We also apply the Wilcoxon rank sum test to compare our models with the baseline models.

Results
We compare our KBLSTM and KBLSTM-CRF models with the following baselines: BiLSTM, a vanilla BiLSTM network trained using the same input, and BiLSTM-Fea, a BiLSTM network that combines its hidden state vector with discrete KB features (i.e., indicators of candidate KB concepts) to produce the final state vector. We also include their variants BiLSTM-CRF and BiLSTM-Fea-CRF, which are trained using the CRF objective instead of the standard softmax objective.
We first report results on entity extraction with gold-standard boundaries for multi-word mentions. This allows us to focus on the performance of entity type prediction without considering mention boundary errors and the noise they introduce in retrieving candidate KB concepts. Table 1 shows the results. 8 We find that the CRF objective generally outperforms the softmax objective. Our KBLSTM-CRF model significantly improves over its counterpart BiLSTM-Fea-CRF. This demonstrates the effectiveness of KBLSTM architecture in leveraging KB information. Table 2 breaks down the results of the KBLSTM-CRF and the BiLSTM-Fea-CRF using different KB settings. We find that the KBLSTM-CRF outperforms the BiLSTM-Fea-CRF in all the settings and that incorporating both KBs leads to the best performance.
Next, we evaluate our models on entity extraction with predicted mention boundaries. We first train a BiLSTM-CRF to perform mention  chunking using the same setting as described in Section 4.1.1. We then treat the predicted chunks as units for entity type labeling. Table 3 reports the full entity extraction results on the ACE2005 test set. We compare our models with the state-of-the-art feature-based linear models , Yang and Mitchell (2016), and the recently proposed sequence-and tree-structured LSTMs (Miwa and Bansal, 2016). Interestingly, we find that using BiLSTM-CRF without any KB information already gives strong performance compared to previous work. The KBLSTM-CRF model demonstrates the best performance among all the models and achieves the new state-of-theart performance on the ACE2005 dataset. We also report the entity extraction results on the OntoNotes 5.0 test set in Table 4. We compare our models with the existing feature-based models Ratinov and Roth (2009) and Durrett and Klein (2014), which both employ heavy feature engineering to bring in external knowledge. BiLSTM-CNN (Chiu and Nichols, 2016) employs a hybrid BiLSTM and Convolutional neural network (CNN) architecture and incorporates rich lexicon features derived from SENNA and DBPedia. Our KBLSTM-CRF model shows competitive results compared to their results. It also presents significant improvements compared to the BiLSTM and BiLSTM-Fea models. Note that the benefit of adding KB information is smaller on OntoNotes compared to ACE2005. We believe that there are two main reasons. One is that NELL has a lower coverage of entity mentions in OntoNotes than in ACE2005 (57% vs. 65%). Second, OntoNotes has a significantly larger amount of training data, which allows for more accurate models without much help from external resources.

Event Extraction
We now apply our model to the task of event extraction. Event extraction is concerned with de-   tecting event triggers, i.e., a word that expresses the occurrence of a pre-defined type of event. Event triggers are mostly verbs and eventive nouns but can occasionally be adjectives and other content words. Therefore, the task is typically addressed as a classification problem where the goal is to label each word in a sentence with an event type or an O type if it does not express any of the defined events. It is straightforward to apply the BiLSTM architecture to event extraction. Similarly to the models for entity extraction, we can train the BiLSTM network with both the softmax objective and the CRF objective.
We evaluate our models on the portion ACE2005 corpus that has event annotations. We use the same data split and experimental setting as in Li et al. (2013). The training procedure is the same as in Section 4.1.1, and we train all the models for about 5 epochs. For the KBLSTM models, we integrate the learned embeddings of WordNet synsets during training.
(a) The X-axis represents relevant NELL concepts for the entity mention clinton. The Y-axis represents the concept weights and the knowledge sentinel weight.
(b) The X-axis represents relevant WordNet concepts for the event trigger head. The Y-axis represents the concept weights and the knowledge sentinel weight.

Results
We compare our models with the prior state-ofthe-art approaches for event extraction, including neural and non-neural ones: JOINTBEAM refers to the joint beam search approach with local and global features (Li et al., 2013); JOINTENTI-TYEVENT refers to the graphical model for joint entity and event extraction (Yang and Mitchell, 2016); DMCNN is the dynamic multi-pooling CNNs in ; and JRNN is an RNN model with memory introduced by Nguyen et al. (2016). The first block in Table 5 shows the results of the feature-based linear models (taken from Yang and Mitchell (2016)). The second block shows the previously reported results for the neural models. Note that they both make use of gold-standard entity annotations. The third block shows the results of our models. We can see that our KBLSTM models significantly outperform the BiLSTM and BiLSTM-Fea models, which again confirms their effectiveness in leveraging KB information. The KBLSTM-CRF model achieves the best performance and outperforms the previous state-of-the-art methods without having access to any gold-standard entities.

Model Analysis
In order to better understand our model, we visualize the learned attention weights α for KB concepts and the sentinel weight β that measures the tradeoff between knowledge and context. Figure 2a visualizes these weights for the entity mention "clinton". In the first sentence, "clinton" refers to a LOCATION while in the second sentence, "clinton" refers to a PERSON. Our model learns to attend to different word senses for 'clinton' (KB concepts associated with 'clinton') in different sentences. Note that the weight on the knowledge sentinel is higher in the first sentence. This is because the local text alone is indicative of the entity type for "clinton": the word "in" is more likely to be followed by a location than a person. We find that BiLSTM-Fea-CRF models often make wrong predictions on examples like this due to its inflexibility in modeling knowledge relevance with respect to context. Figure 2b shows the learned weights for the event trigger word "head" in two sentences, one expresses a TRAVEL event while the other expresses a START-POSITION event. Again, we find that our model is capable of attending to relevant WordNet synsets and more accurately disambiguate event types.
In this paper, we introduce the KBLSTM network architecture as one approach to incorporating background KBs for improving recurrent neural networks for machine reading. This architecture employs an adaptive attention mechanism with a sentinel that allows for learning an appropriate tradeoff for blending knowledge from the KBs with information from the currently processed text, as well as selecting among relevant KB concepts for each word (e.g., to select the correct semantic categories for "clinton" as a town or person in figure 2a). Experimental results show that our model achieves state-of-the-art performance on standard benchmarks for both entity extraction and event extraction. We see many additional opportunities to integrate background knowledge with training of neural network models for language processing. Though our model is evaluated on entity extraction and event extraction, it can be useful for other machine reading tasks. Our model can also be extended to integrate knowledge from a richer set of KBs in order to capture the diverse variety and depth of background knowledge required for accurate and deep language understanding.