Put It Back: Entity Typing with Language Model Enhancement

Entity typing aims to classify semantic types of an entity mention in a specific context. Most existing models obtain training data using distant supervision, and inevitably suffer from the problem of noisy labels. To address this issue, we propose entity typing with language model enhancement. It utilizes a language model to measure the compatibility between context sentences and labels, and thereby automatically focuses more on context-dependent labels. Experiments on benchmark datasets demonstrate that our method is capable of enhancing the entity typing model with information from the language model, and significantly outperforms the state-of-the-art baseline. Code and data for this paper can be found from https://github.com/thunlp/LME.

All current FET models rely on distant supervision (DS) (Mintz et al., 2009) to obtain training * Corresponding author: Zhiyuan Liu.

Raw
Schwarzenegger was elected to be the governor.
Schwarzenegger acted in the film Terminator.
Good (A) politician was elected to be the governor.
(An) actor acted in the film Terminator.
Bad (An) actor was elected to be the governor.
(A) politician acted in the film Terminator. data, due to the lack of large-scale human-labeled data. Such reliance on DS has been a significant problem for entity typing. In DS, an entity mention in the context sentence is first linked to a named entity in the knowledge base (KB). The entity has type labels 1 stored in the KB, and all labels will be assigned to this entity mention. In other words, these are noisy global labels without considering the specific context of the entity mention. On the other hand, entity typing aims to predict context-dependent types of the entity mention, and test datasets are all human-labeled. The difference between DS and human annotation leads to a huge gap in performances between training/development and test dataset. 2 To address this problem, we propose Entity Typing with Language Model Enhancement (LME). It is able to measure the compatibility between the context sentence and each distantly supervised label, in an unsupervised manner using meaning of the label.
In previous works, the hierarchical structure of labels has been considered (Ren et al., 2016b;Karn et al., 2017;Xu and Barbosa, 2018). However, to the best of our knowledge, precious information inside names of labels has never been used. For example, whether the label is /person/actor or /foo/bar makes no difference. We argue that, the meaning of entity mention words can also be expressed by the name of its context-dependent type, to some extent. Based on this argument, replacements with contextdependent types make more sense than those with global-but-context-irrelevant ones. We provide an example in Table 1. The entity Schwarzenegger has types /actor and /politician, and we can see that replacements with context-dependent types produce better sentences.
The natural way to evaluate the soundness of sentences is language modeling (Bengio et al., 2003;Mikolov et al., 2010). Our method employs a language model to evaluate the soundness of each synthetic sentence generated by replacing the entity mention with its type's name. It is able to focus more on context-dependent types.
We conduct experiments to compare our model with the state-of-the-art baseline on two widely used datasets. The results demonstrate that LME is capable of improving entity typing systems by considering the meaning of labels, and alleviating the problem of noise in distantly-supervised entity typing.

Model
Our model ( Figure 1) consists of two parts: an entity typing (ET) module, and a language model enhancement (LME) module.
ET predicts a probability distribution vector y for an entity mention, where each entry y i represents the predicted probability for each type label.
In the training phase, LME optimizes a language model whose input includes y, and also back-propagates gradients through y to parameters inside ET. In the testing phase, LME is not involved and y is directly used for inference: if y i is greater than a threshold 0.5, the i th type is considered true; if all entries are below the threshold, the type with the greatest entry is considered true.

Entity Typing Module
Entity typing is defined on an ontology T (the set of all labels). Given an entity mention e and its context sentence s = {l 1 , l 2 , ..., e, r 1 , r 2 , ...} (l i and r i are left and right context words), the typing model predicts a vector y indicating the probabil- ity distribution over all labels in the ontology: where σ is the sigmoid function, W y is a parameter matrix, and [; ; ] denotes concatenation. Three vectors: Mention, Context and Feature, are built from e and s as follows: Entity mention vector There may be multiple words e 1 , e 2 , ... in the entity mention, and v M is the average of word embeddings of these words.
Context vector Two bi-directional LSTMs (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) are used for left and right context words. The outputs of BiLSTMs further go through a self-attention layer. v C is the concatenation of the attention-layer outputs.
Hand-crafted feature vector A sparse feature vector f is built from the entity mention e. The features are adopted from those used by Gillick et al. (2014) and Yogatama et al. (2015).
where W f is the projection matrix. After y is calculated, DS provides a label vector y * ∈ {0, 1} |T | , where |T | is the number of labels. The loss function for typing is the cross-entropy between y and y * :

Language Model Enhancement Module
The core part of the LME module is an LSTM language model (Sundermeyer et al., 2012). The language model takes a sentence {w 1 , w 2 , ..., w n } as input, and assigns a probability to this sentence. Concretely, at step i, the LSTM reads the word sub-sequence {w 1 , ..., w i }, and predicts the probability of w i+1 succeeding the sub-sequence. A well trained language model predicts high probability for a reasonable sentence. Before applying the LME module to enhance the ET module, the language model is pre-trained with sentences from the training set. The loss function for s in the pre-train phase is: where bold face letters are word embeddings for corresponding words. LM(·) is the language model loss function: accumulative step-wise logprobability of each word of the input sequence. A well-trained language model calculates smaller loss for a more reasonable sentence. After pre-training the language model, the LME module is combined with the ET module. Concretely, we assign an embedding vector L i for each label, and take the sum of label embeddings weighted by y. The sum h replaces e in the input sequence of the language model: where L is the matrix of all label embeddings, and J lm is loss function of the language model used in the training phase. In order to ensure that label embeddings are in the same semantic space with word embeddings, L is initialized with word embeddings of the labels' names.
In the training phase, parameters of the ET module are updated w.r.t.
where λ is the weight to balance the loss.
The ET module has a much smaller parameter space than the language model. In order to make full use of the gradients, we only update parameters of the ET module and fix the language model in the training phase. Now that the language model is fixed, when J lm is being minimized, it adjusts the probability distribution in y. If a label i is compatible with the context sentence, its corresponding entry y i is expected to have a high value. Gradients are back-propagated through y and update parameters of the ET module. In this way, y can learn to be more contextdependent.

Dataset
We employ two well-established and widely-used dataset for evaluating our model: WIKI (Ling and Weld, 2012) and ONTONOTES (Gillick et al., 2014).
Training parts of both datasets are labeled with DS, and testing parts are annotated by human. Therefore they are suitable for evaluating how our model can narrow the gap between DS and ground-truth context-dependent labels. Statistics of the two datasets are provided in Table 2

Experiment Settings
The baseline for comparison is the hybrid model NFGEC proposed by Shimaoka et al. (2017). It is described as the ET module of our model. Our own model is referred to as NFGEC+LME. We implement our model based on the source code of NFGEC. 3 For a fair comparison, the ET module is unchanged, including all hyperparameters and methods of parameter random initialization. Word embeddings are initialized with pretrained embeddings provided by Pennington et al. (2014).
There are a few additional hyperparameters in our model. The most important one is λ, the weight between two parts of the loss function. Other ones include the learning rate r for pretraining the language model and the hidden size h of LSTM used in the language model. We perform a grid-search based on performances on the development set, and set r = 0.005 and h = 500. Details of λ will be discussed in Section 3.4.

Overall Results
We compare vanilla NFGEC and NFGEC+LME in Table 3. The results of NFGEC come from the paper by Shimaoka et al. (2017). For running NFGEC+LME, λ is set to 0.005 in WIKI and 0.001 in ONTONOTES.
Evaluation metrics include strict accuracy, macro-F1 score and micro-F1 score (Ling and Weld, 2012  From the results we see that: (1) In both datasets, LME consistently helps NFGEC to better classify entity mentions into their context-dependent types. We can see improvements in all metrics. This is because LME is capable of evaluating the appropriateness of each label and distinguishing context-dependent ones from global-but-context-irrelevant ones. Therefore LME helps the system to focus on more reasonable types.
(2) Among all metrics, the improvement on strict accuracy is the most significant. Strict accuracy is the proportion of entity mentions whose predicted types are completely identical with human annotation. It is therefore the most important metric for evaluating how robust the system is against noisy labels. The ability of LME alleviating noises from DS contributes to improving strict accuracy most.

Analysis of λ
We choose the optimal λ values for results in Table  3 according to their performances on the development set. After they are chosen, we compare the results on the test set under different values of λ in Figure 2.
Conclusions from the previous subsection can be seen again: when λ is set to a proper value, our model can consistently outperform the baseline over all metrics; strict accuracy is the metric with the most significant improvements. Also, we notice that the performances deteriorate when λ grows too large, and may even be worse than the baseline. The reason is that LME is a kind of regularization: its role is only in the training phase, exchanging the performance on training set with that on test set. So λ, as a regularization coefficient, must be carefully chosen.

Qualitative Analysis
In order to have an intuitive feeling of the model, we provide an example of LME's effect.
In the following sentence (from the test set of WIKI), both models try to predict the type of Lake Placid which, in this very context, is a town in New York. We show all labels with at least one score over the threshold 0.5, or is annotated true by human in Table 4.
Scaringe dismissed Brian Barrett of Lake Placid as one of his defense attorneys.  NFGEC predicts a high score for /person and a low score for /location, probably because both words of the entity mention are firstletter capitalized and thus look like a person's name. LME, however, may consider the sentence structure person of location to be more reasonable than person of person, and makes the correct judgment between these two labels. As for /location/city, LME also shows higher confidence than NFGEC, but it is still regretfully below the threshold. This also demonstrates a weakness of LME: limited by the performance of the ET module. Addressing this limitation can be considered as a future direction for improvement.

Conclusion
In this paper, we propose a novel architecture LME to improve entity typing systems. It utilizes a language model and a set of label embeddings to judge the compatibility between labels and context sentences, and reduces noises introduced by DS. Experiments demonstrate that LME is capable of helping NFGEC, a state-of-the-art entity typing model, to alleviate the problem of noisy labels, and reaching a new state-of-the-art performance. Since the LME module does not depend on the ET module, we are confident that LME can be adapted to other entity typing systems as well.
Future Work Utilizing meaning of labels to alleviate the problem of noises from DS is an interesting direction. We make the first attempt in this paper, and we believe the direction is worth further exploring. For example, (1) how to train a language model that is sensitive with incorrect labels; (2) how to combine meaning of labels with the hierarchical structure of types; (3) how to find the optimal λ easily for a new dataset. LME may also be extended to other tasks that also suffer from noises and incompleteness of DS, such as relation extraction (Takamatsu et al., 2012;Ritter et al., 2013;Lin et al., 2016). However, since a relation does not have a specific location in the sentence, it needs more effort than a simple replacement.