On the Complementary Nature of Knowledge Graph Embedding, Fine Grain Entity Types, and Language Modeling

We demonstrate the complementary natures of neural knowledge graph embedding, fine-grain entity type prediction, and neural language modeling. We show that a language model-inspired knowledge graph embedding approach yields both improved knowledge graph embeddings and fine-grain entity type representations. Our work also shows that jointly modeling both structured knowledge tuples and language improves both.


Introduction
The surge in large knowledge graphs-e.g., Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), YAGO (Suchanek et al., 2007)-has induced knowledge graph-based applications. Properly making use of this structured knowledge is a prime challenge. Knowledge graph embedding [KGE] Socher et al., 2013) addresses this problem by representing the nodes (entities) and their edges (relations) in a continuous vector space. Learning these representations deduces new facts from and identifies dubious entries in the knowledge base. It also improves relation extraction , knowledge base completion  and entity resolution (Nickel et al., 2011).
Entity typing can provide crucial constraints and information on the knowledge contained in a KG. While historically this has been modeled as explicitly structured knowledge, and recent work has modeled the contextual language in order to make in-context entity type classifications, we argue that language modeling techniques provide an effective approach for modeling both the explicit and implicit constraints found in both structured resources and free-form contextual language.
Meanwhile, while language modeling [LM] has historically been a core problem within natural language processing (Rosenfeld, 1994) Figure 1: Our joint learning framework learns the representation for the entity "Barack Obama's" in the same embedding space as that of the given input contextual description, "Barack Obama gave a speech to Congress." Further, by learning the entity type of '/person/politician', the model provides a better contextual understanding of the underlying entity.
learning advances have been very successful in convincing the community of the power and flexibility of language modeling (Peters et al., 2018;Devlin et al., 2019;Yang et al., 2019, i.a.).
Building off of insights and advances in knowledge graph embedding, entity typing, and language modeling, we identify and advocate for leveraging the complementary nature of knowledge graphs, entity typing, and language modeling. In it, we introduce a comparatively simple framework that uses powerful, yet well-known, neural building blocks to (jointly) learn representations that simultaneously capture (1) explicit facts and information stored in a knowledge base, (2) explicit constraints on facts (exemplified by entity typing), and (3) implicit knowledge and constraints communicated via natural language and discourse. Figure 1 provides an overview of the joint learning framework proposed in this work: an entity ("Barack Obama") along with its relations are represented in a continuous vector space. The framework also understands the underlying type ("/person/politician") for the given entity by learning the entity representation with contextual understanding ("Barack Obama gave a speech to Congress"). By using the type and the factual information the framework enhances the comprehension of the focus entity in downstream applications like language modeling. 1 We note that others have explored what KG facts have already been learned by specific, advanced/contemporary LMs (Petroni et al., 2019). That work utilized a pre-trained BERT model and queried what types of KG facts it contains. In addition, our primary goal is not broad, state-of-the-art performance-though we demonstrate that very strong performance is achievable. Rather, our goal is to examine what the complementary strengths, and evident limitations, of language modeling techniques for knowledge and entity type representation are. In doing so, we show that our joint framework yields empirical benefits for individual tasks. Our models leverage context-independent word embeddings, and we specifically eschew language models pre-trained on web-scale data. 2 Our results further suggest that schema-free approaches to knowledge graph construction/embedding and fine grained entity typing should be studied in greater detail, and competitive, if not state-of-theart, performance can be obtained with comparatively simpler, resource-starved language models. This has promising implications for low-resource, few-shot, and/or domain-specific information extraction needs.
Using publicly available data, our work has four main contributions. (1) It advocates for a languagemodeling based knowledge graph embedding architecture that achieves state-of-the-art performance on knowledge graph completion/fact prediction against comparable methods. (2) It introduces a neural-based technique based on both knowledge graph embedding and language modeling to predict fine-grain entity types, which yields competitive through state-of-the-art performance against comparable methods. (3) It proposes the joint learning of factual information with the underlying entity types in a shared embedding space. (4) It demonstrates that learning a knowledge graph embedding 1 Though the entity typing examples here could be interpreted as being hierarchical, our method neither assumes nor requires any type hierarchy. 2 We do not deny that current pre-trained language models can be effective for other language-based tasks beyond language modeling. However, the reason we do not use transformer LMs like BERT or GPT-2 is because the amount of data they are pre-trained with can make it difficult to (a) fairly compare to previous work (is it the modeling approach, or the underlying, large-scale data at work?), and (b) identify and track the benefits of learning our tasks jointly. model and language model in a shared embedding space are symbiotic, yielding strong KGE performance and drastic perplexity improvements. 3

Background
The underlying information in the knowledge bases is difficult to comprehend and manipulate (Wang et al., 2014). A vast number of knowledge graph embeddings techniques have been proposed over the years to mirror the entities and relations in the knowledge graphs. RESCAL (Krompaß et al., 2013) is one of the first semantic-based embedding technique that captures the latent interaction between the entities and the relation. A model such as RESCAL can use graph properties to improve the underlying entity and relation representations (Padia et al., 2019;Balazevic et al., 2019;Minervini et al., 2017). A more simplified approach is defined in DistMult (Yang et al., 2014) by restricting the relation matrix to a diagonal matrix.
Neural Tensor Network (NTN) (Socher et al., 2013) is one such technique that combines the relation specific tensors with head and tail vector representation over non-linear activation function mapped to hidden layer representation. Translational methods like TransE  use distanced based models to represent entities and the relationships in the same vector space R d . TransH (Wang et al., 2014) overcomes the shortcomings of TransE by modeling the vector representation with relations specific hyperplane. TransR (Lin et al., 2015), TransD (Ji et al., 2015) model the representation similar to TransH by having relation specific spaces and decomposing the relation specific projection matrix as a product of two vector representations respectively.
Recognition of entity types into coarse grain types has been explored by researchers over the past two decades. Neural approaches have brought advances in extending the prediction problem from coarse grain entity types to fine-grain entity types. Work by Ling and Weld (2012) was one of the first attempts in predicting the fine-grain entity types. The work framed the problem as multi-class multilabel classification. This work also led to an important contribution of a labeled dataset FIGER, widely used as a benchmark dataset in measuring the performance of fine-grain entity type prediction architectures. Ren et al. (2016a) introduced the method of automatic fine-grain entity typing by using hierarchical partial label embedding. Shimaoka et al. (2016) introduced a neural fine-grain entity type prediction architecture that uses semantic context with self-attention and handcrafted features to capture semantic context needed for fine-grain type prediction. Xin et al. (2018b) showed that analyzing sentences with a pre-trained language model enhanced prediction performance. Zhang et al. (2018a) introduced a document level context and signifies the importance of mention level attention mechanism along with the sentence-level context in enhancing the performance of fine-grain entity prediction. Xu and Barbosa (2018) enhanced neural fine-grain entity typing by penalizing the cross-entropy loss with hierarchical context loss for the fine-grain type prediction.
Language modeling has seen great progress in recent times. Bengio et al. (2000) pioneered the renewed use of distributed representation for dealing with the dimensionality curse imposed by the statistical methods. Their language model used recurrent neural networks for dealing with long sequences of text. Mikolov et al. (2010) extended the idea of building the recurrent neural networkbased language models with an improved feedback mechanism of backpropagation in time.
We are not the first to examine the intersection of knowledge graph embedding and language modeling. Ristoski and Paulheim (2016); Cochez et al. (2017) directly embed RDF graphs using languagemodeling based techniques. Ahn et al. (2017) and Logan IV et al. (2019) have more recently leveraged information from a knowledge base to improve language modeling. However, in addition to knowledge graphs and language modeling, we additionally consider fine-grain entity typing.
With the success of contextualized vector representations and the availability of large-scale, pretrained language models, there have been a number of efforts aimed at improving the knowledge implicitly contained in word and sentence representations. For example, Bosselut et al. (2019) introduce COMET, which describes a framework to learn and generate rich and diverse common-sense descriptions via language models (e.g., the autoregressive GPT-2). Similarly, Zhang et al. (2019) and  provide insights into aspects of LM on downstream NLP tasks. While we share the overall goal of improving knowledge representation within language modeling, the short-term goals are dif- ferent, as we focus on individual facts, rather than traditional background/commonsense knowledge, and demonstrating the complementary nature of KGE, entity typing, and LM.

Methodology
This section introduces the framework for jointly learning knowledge graph embedding (KGE), fine grain entity types (ET) and language models (LM). It uses a multi-task learning architecture built over baseline architectures for all three tasks. We begin by introducing LM-inspired knowledge graph embedding and fine grain entity typing architectures; we describe the joint learning architectures in §5. Fundamentally, our approach relies on appropriate and select parameter sharing across the KGE, ET, and LM tasks in order to learn these models jointly. While joint learning or multi-task learning through shared parameters have been examined before for a number of tasks, we argue that this parameter sharing is a very effective way to improve KGE, ET, and/or LM (for a particular baseline). Its simplicity is a core benefit.

Knowledge Graph Embedding as a Language Model
The architecture in Figure 2 embeds the factual entities and the relations. Let G be a knowledge graph (KG) with nodes V and edge E, where V is a set of entities e 1 , . . . , e |V | which are connected to each other by edges E. E is a set of K relations r 1 , . . . , r k . The architecture learns to embed the entities and relations into a (traditionally dense) vector space. Given the head entity e i , relation r k and tail entity e j , we predict whether a given triplet The model is a combination of a bi-LSTM (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) and a feed-forward architecture. In the spirit of language modeling, we represent each triple x i input to the architecture as a sequence of n tokens (x i 1 , x i 2 , .., x in ). These tokens are represented in a continuous vector The bi-LSTM layer produces a learned representation of each token by maintaining two hidden states for each word: the forward state − → h it learns representation from left to right (Eq. (1)) and the backward state ← − h it learns the representation from right to left (Eq. (2) The forward and the backward states of the bi-LSTM layer are concatenated to produce a sequentially encoded representation h i for each time step t given the input sequence x i . The bi-LSTM weight matrices W and V and b are learned during training.
In principle the bi-LSTMs can be stacked, though we found not stacking to be empirically effective. Though the bi-LSTM produces a sequence of hidden states, we summarize the information captured by it in a single, "final" state C final . This state is then used to represent the information encoded by the whole sequence for the subsequent classification task. We let the rightmost state represent the "final" state, i.e., The feed-forward architecture is a multi-layer perceptron with L = 3 rectified linear hidden layers (ReLU). The input to the feed-forward layer is a learned final cell state representation C final from the bi-LSTM sequence encoder. The feedforward process captures the information from the learned sequence encoder and outputs a transformed representation z l from the final output layer: z l = ReLU(W l z l−1 + b l ), with z 0 = C final , and layer-specific weights W l and biases b l .
The output representation z L is then used to calculate the semantic matching score for the factual input x t . This score is calculated by incorporating the learned representation z L with the sequentially encoded final sequence step representation h t . The product is then passed through a sigmoid activa- where θ is a collection of network parameters used for training the language model-inspired knowledge graph embedding architecture. These parameters are jointly learned by minimizing a weighted cross-entropy loss with 2 regularization (Eq. (4)): where k is the weight assigned to the positive samples during the training, y i represents the original labels, and λ is the regularization parameter. As a result of our KGE method, we do not produce or store single, canonical representations of entities and relations. We argue that the lack of a canonical entity embedding is a large benefit of our model. First, it is consistent with the push for contextualized embeddings. Second, we believe that, even in a KG, an entity's precise meaning or representation should depend on the fact/tuple that is being considered. 5

Neural-Fine Grain Entity Type Prediction
Recognizing the type of the given entity has been an integral part of tasks like knowledge base completion , question answering and co-reference resolution. Ling and Weld (2012) extended the problem of entity type prediction to fine-grain entity types. Given an input vector V x for entity x, type embedding matrix θ, the function g predicts all the possible entity types t for given entity x as g(V x ; θ) = θ T V x . The model learns the parameters θ by optimizing the hinge loss to classify a given entity into all the possible types T:  An entity is predicted to be of type t if g(x; θ) is greater than a given threshold value τ (typically, τ = 0.5, though it can be set empirically).
The architecture in Figure 3 shows different sets of embedding-based features used to predict the entity type t. Word-level features and context level features-word spans to the left and right of the entity-are taken into consideration. The feature design used here is similar to the design of the features introduced by Shimaoka et al. (2016). We note that our method neither assumes nor requires any type hierarchy, though including a type hierarchy is an avenue for future exploration.

Mention Encoder
We encode a mention representation m as the average of word embedding vectors u i for all words i present in the given entity e: m = 1 |n| n i=0 u i .

Context Encoder
The contextual representation for the given mention e is performed by dividing into left context l c and right context r c , where the left context is all the words present on the left of the given entity e, and the right context contains all the words present to the right of the given entity e. The left and right context are encoded by passing the context through a bi-LSTM sequence encoder (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997). The sequence encoder is similar to the one used by Zhang et al. (2018a). The outputs of the bi-LSTM sequence encoder are the sequential vector representation from both forward (left-to-right) and backward pass (right-to-left), (l f , l b ) = BiLSTM(l c , h, h t−1 ) and (r f , r b ) = BiLSTM(r c , h, h t−1 ), where (l f , l b ) are the sequential output for the left context from forward and backward passes, (r f , r b ) are the sequential outputs from the right context from forward and backward passes, h and h t−1 are the current and the previous hidden states for forward and backward passes respectively. Left outputs are concatenated to form a left-looking encoding L c = concat[l f , l b ], while right outputs are concatenated to form a right-looking encoding R c = concat[r f , r b ]. The complete contextual representation C of the context is the concatenation of the left context and right context representations, Attention We use an attention mechanism to reweight contextualized token embeddings. The attention layer, similar to that of Shimaoka et al. (2016), is a 2 layer feed forward neural architecture where the attention weight for each time step of the context representation is learned given the parameter matrix W a and W s : a i = softmax(W s tanh(C i · W a )). The context representation is a weighted sum of attention and the context representation, C rep = t i=0 a i · C i . The attention mechanism used here differs from Shimaoka et al. (2016) such that in our work the contextual embeddings share the same attention parameters. The features extracted from the mention encoder m and attention weighted context encoder C r are concatenated to form a learned representation V = concat(m i , C rep ) that is passed to the feed-forward architecture for classification.
The feed-forward architecture is a 3-layer neural architecture with a batch normalization layer (Ioffe and Szegedy, 2015) present between the first and the second layers with a ReLU activation (Nair and Hinton, 2010). The input to the feed-forward layer is a concatenated representation from the context and mention encoders. The feed-forward process captures the information from the learned features and outputs a transformed representation q l = max(0, V l · q l−1 + d l ) from the final output layer to classify the given mention into the corresponding entity types, where V l , d l are the weights and bias for the hidden layer unit l respectively. We initialize q 0 = C r .

Experimental Settings
The input to the joint learning architectures are the pre-trained GloVe embedding vectors trained on 840 billion words (Pennington et al., 2014). The parameters of the baseline and the joint learning architecture are learned with Stochastic Gradient Descent and Adam (Kingma and Ba, 2014) as a learning rate optimizer. The training of the joint learning networks is performed with alternating optimization. The loss functions of the respective tasks are optimized at each alternate epoch/ interval. The hyper-parameters for training these joint architecture are chosen manually for the bestperforming models on validation sets.
Data For a direct comparison of the performance as possible, we use previously studied datasets. We evaluate KG triple classification using the standard datasets of WordNet 11 (WN11) and Freebase 13 (FB13). WN11 (Strapparava and Valitutti, 2004) is a publicly available lexical graph of synsets (synonyms). Freebase (Bollacker et al., 2008) is a collaborative ontology consisting of factual tuples of entities related to each other through semantic relation. While recent work has advocated for examining variants and other derivatives of these datasets such as FB15k-237 and WN18RR (Toutanova and Chen, 2015;Dettmers et al., 2018;Padia et al., 2019, i.a.), there is a relative lack of previous experimental work on these newer datasets. Given space limitations, and in order to compare to the vast majority of previous work, we chose to report on the more common WN11 and FB13. We evaluate fine grain entity type prediction on  the well-studied OntoNotes (Hovy et al., 2006) and FIGER (Ling and Weld, 2012) datasets. The OntoNotes dataset used here is a manually curated dataset by Gillick et al. (2014) Lastly, we evaluate the joint KGE and LM on WikiFact (Ahn et al., 2017), built using the facts from Freebase and Wikipedia descriptions. The content of the dataset is limited to Film/Actor/ from Freebase. Further the anchor fact defined in the text of the dataset are not used for training the joint model. The description of the entities in the original dataset contain both the summary and the body from Wikipedia. The current study is performed by using the description from the summary section defined in the dataset. The joint model is trained and evaluated with the split of 80/10/10 for train, validation and test sets, respectively.
Metrics KGE triple classification is evaluated through accuracy. The entity type model's performance is evaluated based on three common entity typing metrics-Strict F1, Loose Macro F1 and Loose Micro F1 (Ling and Weld, 2012)-while language modeling is measured by perplexity.
Previous Work as Baselines When possible, we directly compare our model's performance to that of previously published work.

Results and Discussion
This section presents the results of our basic KGE, entity typing models, and the joint learning architecture and their comparison to previous methods. The models were trained using either a 16GB V100 or 11GB 2080 TI GPU (single GPU training only).

The Effectiveness of a LM-inspired KGE
The proposed knowledge graph embedding architecture ( §3.1) is trained for triple classification task: given an input triple x i , predict whether the fact it represents is true or not. Table 1 provides an overview of performance of our architecture in comparison to previously studies approaches, obtained from the corresponding paper.
Examining the results on WN11 and FB13, we see that in all but one case our approach improves upon the state of the art performance on triple classification task; in that one case (DistMult-HRS on WN11) our model was very competitive. These strong results support our hypothesis that language modeling principles can be an effective knowledge graph embedding technique. In examining perrelation performance on both WN11 and FB13, we observed an increase in the lower bound of accuracy results for relationships on both WordNet and Freebase, compared to Socher et al. (2013). We see a rise in accuracy from Socher et al. (2013)'s 75.5% to 81% for the (domain region) relation from WordNet. On Freebase, we see performance for the institution relation goes from 77.2% to 80.9% with the current architecture.
Recently, Yao et al. (2019) presented KG-BERT, which uses a pretrained BERT model to encode and classify triples. While this approach is empirically powerful, and surpasses our approach, we note that due to the limited training context of the current architecture, directly comparing those triple classification results with ours would be mischaracterizing the strengths and limitations of both approaches. Considering the training complexity and costs of transformer networks, our model presents an appealing balance between efficacy and efficiency.

The Effectiveness of Entity Typing with KGE-Inspired Models
Our novel neural fine grain entity type prediction techniques is compared with previous approaches in Table 2. The neural architecture provides an improvement on FIGER in F1. To have a direct comparison, whe datasets used for the experiments are same as used by Shimaoka et al. (2016) and Zhang et al. (2018a). Our method uses a margin based loss function to learn entity types, and outperforms all the previous methods (Abhishek et al., 2017;Ren et al., 2016a,b) that learn fine grain entity type prediction through margin base loss functions and evaluated on the same datasets.

The Effectiveness of Joint KGE and Entity Typing
Building on the baseline models, the joint model ( Figure 3) addresses the implicit constraint given in the knowledge graph. The architecture learns to correlate the mention entities with the entities present in the context to addresses the problems of "context-entity separation" and "text knowledge separation," as defined by Xin et al. (2018a)   of learning fine-grain entity types and knowledge graph embedding jointly with steady performances on either task with respect to their baselines.

The Effectiveness of Joint KGE and Language Modeling
We examine the complementary nature of LM and KGE on the WikFacts dataset introduced by Ahn et al. (2017), which contains both sentences and KGE-style tuples. Figure 4 shows the architecture for jointly learning to embed a KG and model language. We use a single-layer LSTM (unidirectional: left-to-right) for language modeling, though the core KGE architecture relies on an bi-LSTM. We unify these by ensuring that the LM LSTM and the left-to-right portion of the KGE bi-LSTM use the same weights. We compare this joint approach to the same models trained separately and inde-   Figure 4 (a bi-LSTM KGE, whose forward cells are the cells of a unidirectional LSTM LM). In 6b, we provide results where we replace the bi-LSTM KGE with LSTM LM.
pendently, without any weight sharing, evaluating the LMs on perplexity (lower is better) and KG prediction accuracy (higher is better). We use a vocabulary of the 70k most frequent words. As Table 6a shows, while there is a very slight decrease in KG prediction accuracy, the distinct improvement in the performance of language model over the baseline LM demonstrates that joint learning is particularly effective for language modeling. This suggests that even simple joint learning can be an effective way of using stated knowledge to improve language modeling.
While joint learning allowed the KG to help the LM, the reverse was not true. We speculate that this is in part because, from a language modeling perspective, the KGE model is able to consider both the forward and backward components. To test this, we replace the KGE bi-LSTM with the same unidirectional LSTM used by the LM. We show these results in Table 6b. Similar to the previous results, Sentences Input sentence stephen percy steve harris born 12 march 1956 is an english musician and songwriter known as the bassist occasional keyboardist backing vocalist primary songwriter and founder of the british heavy metal band iron maiden he is the only member of iron maiden to have remained in the band since their inception in 1975 and along with guitarist dave murray to have appeared on all of their albums Output (Joint model) joseph john james unk born 5 april 1949 is an english musician and actor known as the greatest and guitarist the vocalist guitarist songwriter and guitarist of the band heavy metal band the band he is the founding child of the team band have been by the band until its death in 2003 and toured with unk unk unk they have appeared in one of Output (baseline) peter baron dickie unk born 11 august 1943 is an english singer and best and as the most and and and and lead songwriter and member of the heavy rock rock band unk side he is the third singer of the band band have been with the band since its breakup in 1992 while cofounded with with dave tended has have collaborated on hundreds of their films Table 7: We provide an example of the sentence predicted by the language model jointly learned with knowledge graph embedding and the independently trained language model. Notice how some implicit constraints, learned from the KGE, are transferred to the language model. KGE allowed LM perplexity to decrease significantly. However, we also see that the LM yielded a 3 point absolute improvement in KG prediction, supporting our hypothesis.
To further demonstrate how our joint learning method improves the semantic understanding of the language, we qualitatively examine the generative capacity of these LMs in Table 7. This provides an example of how joint training a KG and LM can improve output over a singly-trained LM on the same language data, and suggests that joint learning allows transfer of some implicit constraints in the language by learning the underlying relationships between the entities. While both are over-reliant on conjunctive structure, notice how the singly-trained baseline LM starts off alright, but then as the generation continues, loses coherence. Meanwhile, the jointly trained model maintains more coherence for longer. This suggests the KGE training is successfully transferring appropriate thematic/factive knowledge to the LM.

Conclusion
This work proposes a joint learning framework for learning real value representations of words, entities, and relations in a shared embedding space. Joint learning of factual representation with contextual understanding shows improvement in the learning of entity types. Learning the language model with knowledge graph embedding simultaneously enhances the performance on both modeling tasks. Our results suggest that language modeling could accelerate the study of schema-free approaches to both KGE and FNER, and strong performance can be obtained with comparatively simpler, resourcestarved language models. This has promising implications for low-resource, and few-shot, and/or domain-specific information extraction needs.