Evaluating Word Embeddings in Multi-label Classification Using Fine-Grained Name Typing

Embedding models typically associate each word with a single real-valued vector, representing its different properties. Evaluation methods, therefore, need to analyze the accuracy and completeness of these properties in embeddings. This requires fine-grained analysis of embedding subspaces. Multi-label classification is an appropriate way to do so. We propose a new evaluation method for word embeddings based on multi-label classification given a word embedding. The task we use is fine-grained name typing: given a large corpus, find all types that a name can refer to based on the name embedding. Given the scale of entities in knowledge bases, we can build datasets for this task that are complementary to the current embedding evaluation datasets in: they are very large, contain fine-grained classes, and allow the direct evaluation of embeddings without confounding factors like sentence context.


Introduction
Distributed representation of words, aka word embedding, is an important element of many natural language processing applications.The quality of word embeddings is assessed using different methods.Baroni et al. (2014) evaluate word embeddings on different intrinsic tests: similarity, analogy, synonym detection, categorization and selectional preference.Different concept categorization datasets are introduced.These datasets are small (<500) (Baroni et al., 2014;Rubinstein et al., 2015) and therefore measure the goodness of embeddings by the quality of their clustering.Usually cosine is used as the similarity metric between embeddings, ignoring subspace similarities.
Figure 1: Types (ellipses; green) of the entities (rectangles; red), to which the name "Washington" can refer.Ideally, the embedding for "Washington" should represent all these types.Extrinsic evaluations are also used, cf.Li and Jurafsky (2015).In these tasks, embeddings are used in context/sentence representations with composition involved.
In this paper, we propose a new evaluation method.In contrast to the prior work on intrinsic evaluation, our method is supervised, largescale, fine-grained, automatically built, and evaluates embeddings in a classification setting where different subspaces of embeddings need to be analyzed.In contrast to the prior work on extrinsic evaluation, we evaluate embeddings in isolation, without confounding factors like sentence contexts or composition functions.
Our evaluation is based on an entity-oriented task in information extraction (IE).Different areas of IE try to predict relevant data about entities from text, either locally (i.e., at the context-level), or globally (i.e., at the corpus-level).For example, local (Zeng et al., 2014) and global (Riedel et al., 2013) in relation extraction, or local (Ling and Weld, 2012) and global (Yaghoobzadeh and Schütze, 2015) in entity typing.In most global tasks, each entity is indexed with an identifier (ID) that usually comes from knowledge bases such as Freebase.Exceptions are tasks in lexicon generation or population like entity set expansion (ESE) (Thelen and Riloff, 2002), which are global but without entity IDs.ESE usually starts from a few seed entities per set and completes the set using pattern-based methods.
Here, we address the task of fine-grained name typing (FNT), a global prediction task, operating on the surface names of entities.FNT and ESE share applications in name lexicon population.FNT is different from ESE because we assume to have sufficient training instances for each type to train supervised models.
The challenging goal of FNT is to find the types of all entities a name can refer to.For example, "Washington" might refer to several entities which in turn may belong to multiple types, see Figure 1.In this example, "Washington" refers to "Washington DC (city)", "Washington (state)", or "George Washington (president)".Also, each entity can belong to several types, e.g., "George Washington" is a POLITICIAN, a PERSON and a SOLDIER, or "Washington (state)" is a STATE and a LOCATION.
Learning global representations for entities is very effective for global prediction tasks in IE (cf., Yaghoobzadeh and Schütze (2015)).For our task, FNT, we also learn a global representation for each name.By doing so, we see this task as a challenging evaluation for embedding models.We intend to use FNT to answer the following questions: (i) How well can embeddings represent distinctive information, i.e., different types or senses?(ii) Which properties are important for an embedding model to do well on this task?
We build a novel large-scale dataset of (name, types) from Freebase with millions of examples.The size of the dataset enables supervised approaches to work, an important requirement to be able to look at different subspaces of embeddings (Yaghoobzadeh and Schütze, 2016).Also, in FNT names are-in contrast to concept categorization datasets-multi-labeled, which requires to look at multiple subspaces of embeddings.
In summary, our contributions are (i) introducing a new evaluation method for word embeddings (ii) publishing a new dataset that is a good resource for evaluating word embeddings and is complementary to prior work: it is very large, contains more different classes than previous word categorization datasets, and allows the direct evaluation of embeddings without confounding factors like sentence context1 .

Related Work
Embedding evaluation.Baroni et al. (2014) evaluate embeddings on different intrinsic tests: similarity, analogy, synonym detection, categorization and selectional preference.Schnabel et al. (2015) introduce tasks with more fine-grained datasets.The concept categorization datasets used for embedding evaluation are mostly small (<500) (Baroni et al., 2014) and therefore measure the goodness of embeddings by the quality of their clustering.In contrast, we test embeddings in a classification setting and different subspaces of embeddings are analyzed.Extrinsic evaluations are also used (Li and Jurafsky, 2015;Köhn, 2015;Lai et al., 2015).In most tasks, embeddings are used in context/sentence representations with composition involved.In this work, we evaluate embeddings in isolation, on their ability to represent multiple senses.
Related tasks and datasets.Our proposed task is fine-grained name typing (FNT).A related task is entity set expansion (ESE): given a set of a few seed entities of a particular class, find other entities (Thelen and Riloff, 2002;Gupta and Manning, 2014).We can formulate FNT as ESE, however, there is a difference in the training data assumption.For our task, we assume to have enough instances for each type available, and, therefore, to be able to use a supervised learning approach.In contrast, for ESE, mostly only 3-5 seeds are given as training seeds for a set, which makes an evaluation like ours impossible.
Named entity recognition (NER) consists of recognizing and classifying mentions of entities locally in a particular context (Finkel et al., 2005).Recently, there has been increased interest in finegrained typing of mentions (Ling and Weld, 2012;Yogatama et al., 2015;Ren et al., 2016;Shimaoka et al., 2016).One way of solving our task is to collect every mention of a name, use NER to predict the context-dependent types of mentions, and then take all predictions as the global types of the name.However, our focus in this paper is on how embedding models perform and propose this task as a good evaluation method.We leave the comparison to an NER-based approach for future work.
Corpus-level fine-grained entity typing is the task of predicting all types of entities based on their mentions in a corpus (Yaghoobzadeh and Schütze, 2015;Yaghoobzadeh and Schütze, 2017;Yaghoobzadeh et al., 2018).This is similar to our task, FNT, but in FNT the goals is to find the corpus-level types of names.Corpus-level entity typing has also been used for embedding evaluation (Yaghoobzadeh and Schütze, 2016).However, they need an annotated corpus with entities.
For FNT, however, pretrained word embeddings are sufficient for the evaluation.
Finally, there exists some previous work on FNT, e.g., Chesney et al. (2017).In contrast to us, they do not explicitly focus on the evaluation of embedding models, such that their dataset only contains a limited number of types.In contrast, we use 50 different types, making our dataset suitable for the type of evaluation intended.

Multi-label Classification of Word Embeddings
Word embeddings are global representations of word properties learned from the context distribution of words.Words are usually ambiguous and belong to multiple classes, e.g., multiple part-ofspeech tags or multiple meanings.A good word embedding should represent all information about the word, including its multiple classes.Our evaluation methodology is based on this hypothesis and tries to test this through multi-label classification of word embeddings.Here, we focus on the semantic property of nouns and entity names.We try to find all categories or types of a noun given its embedding.
Multi-label classification of embedding has multiple advantages over current evaluation methods: (i) large datasets can be created without much human annotation; (ii) more fine-grained analysis of the results is possible through analyzing classification performance; (iii) it allows the direct evaluation of embeddings without confounding factors like sentence context.

Fine-grained Name Typing
We assume to have the following: a set of names N , a set of types T and a membership function m : N × T → {0, 1} such that m(n, t) = 1 iff name n has type t; and a large corpus C. In this problem setting, we address the task of fine-grained name typing (FNT): we want to infer from the corpus for each pair of name n and type t whether m(n, t) = 1 holds.
For example, for the name "Hamilton", we should find all of the following: LOCATION, OR-GANIZATION, PERSON, CITY, SPORTS TEAM and SOLDIER, since "Hamilton" can describe entities belonging to those types.Another example is "Falcon" which is used for ANIMAL, AIRPLANE, SOFTWARE, ART.FNT sheds light on to which level these fine-grained types can be inferred from a corpus using embeddings.

Embedding-based Model
We aim to find P (t|n), i.e., the probability of name n having type t.Given sufficient training instances for each type t, we can formulate the problem as a multi-label classification task.As input, we use a representation for n, learned from the corpus C. Distributional representations have shown to capture various types of information about a word, especially their categories or types (Yaghoobzadeh and Schütze, 2015).
After learning an embedding for n, we train two kinds of binary classifiers for each type t to to estimate P (t|n): (i) linear: logistic regression (LR) with stochastic gradient decent; and (ii) nonlinear: multi-layer perceptron (MLP) with one hidden layer and ReLU as the non-linearity.We use the Scikit-learn (Pedregosa et al., 2011) toolkit for training our classifiers.

Dataset
Using Freebase (Bollacker et al., 2008), we first retrieve the set of all entities E n for each name n.2Then, we consider the types of all e ∈ E n the types of n.See Figure 1 for an example: all of the shown types belong to the name "Washington".
Since some of the about 1,500 Freebase types have very few instances, we map them first to the FIGER (Ling and Weld, 2012) type-set, which contains 113 types.We then further restrict our set to the top 50 most frequent types.See Table 5 for the list of types.
Results and analysis.We report the results for all embedding models using LR and MLP in Table 3.We use the following evaluation measures, which are used in entity typing (Yaghoobzadeh and Schütze, 2015) Analysis on the number of name types.As a separate analysis, we measure how the classification performance depends on the N number of types of a name.To do so, we group test names based on their number of types.We keep the groups that have more than 100 members.Then, we plot the F1 results of CBOW and CWIN models trained using MLP classifier in Figure 2.
As it is shown, both models get their best results on names with N = 2.We suppose that the bad performance of N = 1 is related to the fact that one-type names have missing types in our dataset due to the incompleteness of Freebase.The worse F1 of N >= 3 compared to N = 2 is expected since bigger N means that the models need to predict more types from the name embeddings.From N = 4, somewhat surprisingly the F1 increases as N increases.This is perhaps related to the frequency of names in the corpus, and its relation to the number of names types: as N increases, the frequency of words increases and the embedding has a better quality.However, this is only a hypothesis and more investigation is required.The other observation is in the trend of CBOW and CWIN results.CBOW is worse for N <= 2, but it works clearly better for N > 2. This shows that the embedding models behave differently for different number of classes they belong to.This could also be related to the frequency of words.Analysis of the reasons would be interesting.We leave it for the future work.

Conclusion
We proposed multi-label classification of word embeddings using the task of fine-grained typing of entity names.The dataset we built is a resource that is complementary to prior work in embedding evaluation: it is very large, its examples are multi-labeled with very fine-grained classes, and it allows the direct evaluation of embeddings without the need for context.We analyzed the performance of different embedding models on this dataset, showing differences in their performance as well as some of their limits in representing types accurately and completely.
More analysis and evaluation is necessary though, but we believe by using this kind of dataset, we are able to do much more than what we could do before with the small manually built word similarity and categorization benchmarks.

Figure 2 :
Figure 2: Micro-F1 for names with different number of types.

Table 1 :
List of the 50 types in our FNT dataset.

Table 2 :
Some statistics (number of names; average number of types per name) for our name typing dataset.