Multi-Multi-View Learning: Multilingual and Multi-Representation Entity Typing

Accurate and complete knowledge bases (KBs) are paramount in NLP. We employ mul-itiview learning for increasing the accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider high-resource and low-resource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with fine-grained classes demonstrate the effectiveness of multiview learning. We release MVET, a large multiview — and, in particular, multilingual — entity typing dataset we created. Mono- and multilingual fine-grained entity typing systems can be evaluated on this dataset.


Introduction
Accurate and complete knowledge bases (KBs) are paramount in NLP. Entity typing, and in particular fine-grained entity typing, is an important component of KB completion with applications in NLP and knowledge engineering. Studies so far have been mostly for English (Yaghoobzadeh and Schütze, 2015), but also for Japanese (Suzuki et al., 2016).
We employ multiview learning for increasing accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we take high-and low-resource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity's name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view.
Views are defined as kinds of information about an instance that have three properties (Blum and Mitchell, 1998;Xu et al., 2013). (i) Sufficiency. Each view is sufficient for classification on its own. (ii) Compatibility. The target functions in all views predict the same labels for cooccurring features with high probability. (iii) Conditional independence. The views are conditionally independent given the class label.
As in most cases of multiview learning, these three properties are only approximately true for our problem. (i) Not every view is sufficient for every instance. While a name like "George Washington Bridge" is sufficient for typing the entity as "bridge", the name "Washington" is not sufficient for entity typing. (ii) Cases of incompatibility exist. For example, the "Bering Land Bridge" is not a bridge. (iii) Views have some degree of conditional dependence. For example, if a bridge is a viaduct, not a bridge proper, then the description of the bridge will contain more occurrences of the word "viaduct" than for proper bridges whose name does not contain the word "viaduct".
In summary, we make three main contributions. (i) We formalize entity typing as a multiview problem by introducing two metaviews, language and representation; each combination of instances of these two metaviews defines a distinct view. (ii) We show that this formalization is effective for entity typing as a key task in KB completion: multiview and crossview learning outperform singleview learning by a large margin, especially for rare entities and low-resource languages. (iii) We release MVET (Multiview Entity Typing), a large Figure 1: Attention-based multiview learning. View specific representations v j of the entity are transformed to a shared space and summed by attention weights α j into aggregated multiview representation p. A onehidden-layer perceptron computes output vectorŷ. multiview and, in particular, multilingual dataset, for entity typing 1 . This dataset can be used for mono-and multilingual fine-grained entity typing evaluations. In contrast to prior work on entity typing based on Clueweb (a commercial corpus), all our data can be released publicly because it is based on Wikipedia.

Multilingual Multi-Representation Entity Typing
We address the task of entity typing (Yaghoobzadeh and Schütze, 2015), i.e., assigning to a given entity one or more types from a set of types T . E.g., Churchill is a politician and writer.
Our key idea is that we can tap two different information sources for entity typing, which we will refer to as metaviews: language and representation. For the language metaview, we consider N languages (English, German, . . . ). For the representation metaview, we consider three representations: based on the entity's context distribution, based on its canonical name and based on its description. Each combination of language and representation defines a separate view of the entity, i.e., we have up to 3N views. The views are not completely independent of each other: what is written about an entity in English and German is correlated and information derived from the entity name is correlated with its description; see discussion in §1. Still, each view contains information complementary to each other view.
For the context view of the representation metaview, we use entity embeddings (Yaghoobzadeh and Schütze, 2015): each mention of an entity in Wikipedia -identified using Wikipedia hyperlinks -is replaced by the entity's unique identifier. We can then run standard embedding learning. For the name view, we take the sum of the embeddings of the words of the entity name. The description view is based on the entity's Wikipedia page; see §3.
We represent view j of an entity e as the vector or embedding v j ∈ R d j . We combine these embeddings into a multiview representation p ∈ R d of entity e. As discussed above, each v j contributes potentially complementary information.
After learning p, a one-hidden-layer perceptron computes the type predictionsŷ ∈ R |T | : The cost function is binary cross entropy summed over types and training examples: where y i,t andŷ i,t are the gold and prediction for type t of example i.
A simple and effective way of computing the representation p of an entity is what we refer to as MULTIVIEW-CON: a concatenation of the n view embeddings, followed by a non-linear transformation: Concatenation may not be effective because some entities have pages in all Wikipedias and thus have 200 × 3 = 600 views whereas others occur only in one. Also, the views might have different qualities. Therefore, we consider attentionbased weighted average or MULTIVIEW-ATT as an alternative to MULTIVIEW-CON. Embeddings v j live in different spaces, so we first transform them using language specific matrices W j ∈ R d×d j : Then, we compute the attention weights: where a ∈ R d is a vector that is trained to weight the vectors p j . The MULTIVIEW-ATT representation is then defined as: A schematic architecture is shown in Figure 1. We also experiment with two alternatives. MULTIVIEW-AVG: We set all α j = 1/n, i.e., the entity representation is a simple average. MULTIVIEW-MAX: We apply per-dimension maxpooling, p i = max j p j i . The idea here is to capture the most significant features across views.

Dataset and Experiments
In this section, we first introduce our new dataset and then describe our results.
Multiview entity typing (MVET) dataset. Wikipedia and Freebase are our sources for creation of MVET. We try to map each English Wikipedia article of an entity to Freebase. Freebase types are mapped to 113 FIGER tags (Ling and Weld, 2012). We use Wikipedia interlingual links to build multilingual datasets by identifying corresponding Wikipedia articles in non-English languages. So for each entity, we have the English article name as well as the names in other languages (if they exist) and FIGER types of the entity. We use these multilingual names and Wikipedias to build our representation views as described in §3.
We experiment with ten languages: English (EN), German (DE), Farsi (FA), Spanish (ES); and Arabic (AR), French (FR), Italian (IT). Polish (PL), Portuguese (PT), Russian (RU). The procedure described above gives us around 2M entities. We divide them into train (50%), dev (20%) and test (30%) and, for efficient training, sample them stratified by type to ensure enough entities per type. The final dataset used in our experiments contains about 74k / 35k / 50k train / dev / test entities and 102 FIGER types. Dev is used to optimize model hyperparameters. Appendix A gives some more statistics for MVET.
Learning representation views. We refer to the three instances of the representation metaview (see §1) as CTXT (contexts), NAME (name) and DESC (description).
For learning CTXT embeddings, we train WANG2VEC (Ling et al., 2015) on Wikipedia after having replaced hyperlinked mentions of an entity with its ID. NAME is derived from publicly available 300-dimensional fastText (Bojanowski et al., 2017) embeddings. We use the average of the words in a name as its NAME embedding 2 . If a word does not have a fastText embedding, we apply the fastText model to compute it. So there are no unknown words in our dataset. For DESC, we extract the keywords (using tf-idf) of the first paragraph of the Wikipedia article of an entity. The DESC embedding is the average fastText embedding of the keywords.
To reiterate the complementarity of the three representation views: names are ambiguous, but if we use the names of an entity in different languages, we can mitigate this ambiguity. E.g., "Apple" can refer to an entity or a fruit in English, but only to an entity in French. Similarly, the description of an entity is a high quality textual source to extract information from. The simplest case of complementarity is that not all views are available. An entity can be completely missing from one of the languages; it may not have a description because only a stub is provided; etc.

Results
Evaluation metric. Following prior work in entity typing (Yaghoobzadeh and Schütze, 2017), we evaluate by micro F 1 , a global summary score of all system predictions. Entity frequency is an important variable, so we report results for tail (frequency <10, n=35,533), head (frequency >100, n=2,638) and all entities. Table 1 shows results for entity typing on our dataset, MVET. We start with FIGMENT (Yaghoobzadeh and Schütze, 2017) baseline results on MVET dataset (line 0), which is the stateof-the-art system in entity typing. FIGMENT is equivalent to our MULTIVIEW-CON model with only English-CTXT, -NAME and -DESC representations.
Lines 1-4, 9-12, 17-20 are singleview results, e.g., F 1 for tail is 62.0 for the English-CTXT view. Lines 5-8, 13-16, 21-24 combine the four languages; so these are multiview results for the language metaview. All four multiview models are better than the corresponding singleview models in the same block. Lines 25-28 show results for the combination of the two metaviews; a total of twelve views are combined (four languages times three representations). The multi-multi-view models on lines 25-28 outperform all other results. Comparing line 25 and FIGMENT (line 0), adding representations from three more languages result in .5%, .4%, .9% improvements for all, tail and head entities. Line 26 by using ATT improves the results further especially for the tail entities. These results confirm the effectiveness of our contributions: adding language as a metaview, and using ATT instead of CON to combine multiple views.
Lines 29-32 show results for using NAME representations in six additional languages: AR, FR, IT, PL, PT, RU. F 1 is up to more than one percent better than on lines 13-16. This demonstrates the benefit of using more languages -although the effect is limited since only the long tail of entities can improve.
Lines 33-36 vs. lines 25-28 make the same comparison (NAME4 vs. NAME10) for multimulti-view. By ATT, we get a small improvement for tail, but not for head (line 34 vs. line 26). Apparently, there is noise added by considering more languages and this hurts the results for head entities.
A general tendency is that ATT performs better compared to MAX, AVG and CON as the number of views increases (lines 30 and 34) and so the average number of views without information (i.e., missing views) for an entity increases. In contrast to MAX, ATT can combine different views. In contrast to CON and AVG, ATT can ignore some of them based on low attention weights.

Analysis: Crossview Learning
To analyze whether sharing parameters across views is important, Table 2 compares (i) SIN-GLE: twelve different singleview models with (ii) CROSS: a single crossview model that is trained on a training set that combines the twelve individual singleview training sets. For CROSS, we use Eq. 3 with view specific transformation matrices, mapping views in different spaces into a common space, and then Eq. 1 with shared parameters across views. The number of parameters in application is the same for SINGLE and CROSS. results demonstrate that training one model with common parameters over all inputs is helping the classification for non high-resource views.
Multiview learning exploits the complementarity of views: if an entity's type cannot be inferred from one view, then other views may have the required information. Table 2 shows that using multiple views has a second beneficial effect: even if applied to a single view, a model trained on multiple views performs better. Kan et al. (2016)'s image recognition and Pappas and Popescu-Belis (2017)'s document classification findings are similar. Thus, not only does the increased amount of available information boost performance in the multiview setup, but also we can enable crossview transfer and learn a model that makes better predictions even if information is only available from a single view.

Related Work
Entity and mention typing. In this work, we assume that a predefined set of fine-grained types is given. Entity typing, i.e., predicting types of a knowledge base entity (Neelakantan and Chang, 2015;Yaghoobzadeh and Schütze, 2015), is the focus of this paper. Mention typing, i.e., predicting types of a mention in a particular context (Ling and Weld, 2012;Rabinovich and Klein, 2017;Shimaoka et al., 2017;Murty et al., 2018), is a related task. Mention typing models can be evaluated for entity typing when aggregating their predictions (Yaghoobzadeh and Schütze, 2015;Yaghoobzadeh et al., , 2018. Therefore, our public and large entity typing dataset, MVET, can be used as an alternative to the small manually annotated mention typing datasets like the commonly used FIGER (Ling and Weld, 2012). We leave this to the future work. Multilingual entity typing. We build multilingual dataset and models for entity typing. Most work on entity typing has been monolingual; e.g., Yaghoobzadeh and Schütze (2015) (English); and Suzuki et al. (2016) (Japanese). There is work on mention typing (van Erp and Vossen, 2017). Lin et al. (2017) that uses mono-and crosslingual attention for relation extraction. Crosslingual entity linking is an important related task, where the task is to link mentions of entities in multilingual text to a knowledge base (Tsai and Roth, 2016). Many entities are not sufficiently annotated in Wikipedia, and therefore crosslingual entity linking is necessary to learn informative context representations from multiple languages.
Multi-representation of entities. Aggregating information from multiple sources to learn entity representations has been explored for entity typing Yaghoobzadeh et al., 2018), entity linking (Gupta et al., 2017) and relation extraction (Wang and Li, 2016). Here, we add language as a new "dimension" to multi-representations: each language contributes a different CTXT, NAME and DESC rep-resentation.
Our multilingual and multi-representation models are examples of multiview learning. Xu et al. (2013) and Zhao et al. (2017) review the literature on multiview learning. Amini et al. (2009) cast multilingual text classification, a task related to entity typing, as multiview learning. Qu et al. (2017) address node classification and link prediction by attention-based multiview representations of graph nodes. We also adopt a similar approach in our multiview representations for entity typing.

Conclusion
We formalized entity typing as a multiview problem by introducing two metaviews, language and representation; each combination of their instances defines a distinct view. Our experiments showed the effectivess of this formalization by outperforming the state-of-the-art model. Our basic idea of metaview learning is general and is applicable to related tasks, e.g., to relation extraction. We release a large and public multiview and, in particular, multilingual, entity typing dataset.