Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia

The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io/.


Introduction
Entity embeddings, i.e., vector representations of entities in knowledge base (KB), have played a vital role in many recent models in natural language processing (NLP). These embeddings provide rich information (or knowledge) regarding entities available in KB using fixed continuous vectors. They have been shown to be beneficial not only for tasks directly related to entities (e.g., entity linking (Yamada et al., 2016;Ganea and Hofmann, 2017)) but also for general NLP tasks (e.g., text classification (Yamada and Shindo, 2019), question answering (Poerner et al., 2019)). Notably, recent studies have also shown that these embeddings can be used to enhance the performance of state-of-the-art contextualized word embeddings (i.e., BERT (Devlin et al., 2019)) on downstream tasks (Zhang et al., 2019;Peters et al., 2019;Poerner et al., 2019).
In this work, we present Wikipedia2Vec, a Python-based open source tool for learning the embeddings of words and entities easily and efficiently from Wikipedia. Due to its scale, availability in a variety of languages, and constantly evolving nature, Wikipedia is commonly used as a KB to learn entity embeddings. Our proposed tool jointly learns the embeddings of words and entities, and places semantically similar words and entities close to one another in the vector space. In particular, our tool implements the word-based skip-gram model (Mikolov et al., 2013a,b) to learn word embeddings, and its extensions proposed in Yamada et al. (2016) to learn entity embeddings. Wikipedia2Vec enables users to train embeddings by simply running a single command with a Wikipedia dump file as an input. We highly optimized our implementation, which makes our implementation of the skip-gram model faster than the well-established implementation available in gensim (Řehůřek and Sojka, 2010) and fastText (Bojanowski et al., 2017).
Experimental results demonstrated that our tool achieved enhanced quality compared to the existing tools on several standard benchmarks. Notably, our tool achieved a state-of-the-art result on the entity relatedness task based on the KORE dataset. Due to its effectiveness and efficiency, our tool has been successfully used in various downstream NLP tasks, including entity linking (Yamada et al., 2016;Eshel et al., 2017;Chen et al., 2019), named entity recognition (Sato et al., 2017;Lara-Clares and Garcia-Serrano, 2019), question answering (Yamada et al., 2018b;Poerner et al., 2019), knowledge graph completion (Shah et al., 2019), paraphrase detection (Duong et al., 2019), fake news detection , and text classification (Yamada and Shindo, 2019).
We also introduce a web-based demonstration of our tool that visualizes the embeddings by plotting them onto a two-or three-dimensional space using dimensionality reduction algorithms. The demonstration also allows users to explore the embeddings by querying similar words and entities.
The source code has been tested on Linux, Windows, and macOS, and released under the Apache License 2.0. We also release the pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish).
The main contributions of this paper are summarized as follows: • We present Wikipedia2Vec, a tool for learning the embeddings of words and entities easily and efficiently from Wikipedia. • Our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and performed competitively on the various benchmark datasets. • We present a web-based demonstration that allows users to explore the learned embeddings. • We publicize the code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io.

Related Work
Many studies have recently proposed methods to learn entity embeddings from a KB (Hu et al., 2015;Li et al., 2016;Tsai and Roth, 2016;Yamada et al., 2016Yamada et al., , 2018aCao et al., 2017;Ganea and Hofmann, 2017). These embeddings are typically based on conventional word embedding models (e.g., skip-gram (Mikolov et al., 2013a)) trained with data retrieved from a KB. For example, Ristoski et al. (2018) proposed RDF2Vec, which learns entity embeddings using the skip-gram model with inputs generated by random walks over the large knowledge graphs such as Wikidata and DBpedia. Furthermore, a simple method that has been widely used in various studies (Yaghoobzadeh and Schutze, 2015;Yamada et al., , 2018aAl-Badrashiny et al., 2017;Suzuki et al., 2018) trains entity embeddings by replacing the entity annotations in an input corpus with the unique identifier of their referent entities, and feeding the corpus into a word embedding model (e.g., skip-gram  neighboring entities connected by internal hyperlinks of Wikipedia as additional contexts to train the model. Note that we used the RDF2Vec and Wiki2Vec as baselines in our experiments, and achieved enhanced empirical performance over these tools on the KORE dataset. Additionally, there have been various relational embedding models proposed (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015) that aim to learn the entity representations that are particularly effective for knowledge graph completion tasks.

Overview
Wikipedia2Vec is an easy-to-use, optimized tool for learning embeddings from Wikipedia. This tool can be installed using the Python's pip tool (pip install wikipedia2vec). Embeddings can be learned easily by running the wikipedia2vec train command with a Wikipedia dump file 3 as an argument. Figure 1 shows the shell commands that download the latest English Wikipedia dump file and run training of the embeddings based on this dump using the default hyper-parameters. 4 Furthermore, users can easily use the learned embeddings. Figure 2 shows the example Python code that loads the learned embedding file, and obtains the embeddings of an entity Scarlett Johansson and a word tokyo, as well as the most similar words and entities of an entity Python.

+
The neighboring words of each word are used as contexts + The neighboring words of a hyperlink pointing to an entity are used as contexts

Model
Wikipedia2Vec implements the conventional skipgram model (Mikolov et al., 2013a,b) and its extensions proposed in Yamada et al. (2016) to map words and entities into the same d-dimensional vector space. The skip-gram model is a neural network model with a training objective to find embeddings that are useful for predicting context items (i.e., words or entities in this paper) given each item.
The loss function of the model is defined as: where O is a set of all items (i.e., words or entities), C o is the set of context items of o, and the conditional probability log P (o c |o i ) is defined using the following softmax function: where V o ∈ R d and U o ∈ R d denote the embeddings of item o in embedding matrices V and U, respectively. Our tool learns the embeddings by jointly optimizing the three skip-gram-based sub-models described below (see also Figure 3). Note that the matrices V and U contain the embeddings of both words and entities.
Word-based Skip-gram Model Given each word in a Wikipedia page, this model learns word embeddings by predicting the neighboring words of the given word. Formally, given a sequence of words w 1 , w 2 , ..., w N , the loss function of this model is defined as follows: where c is the size of the context words, and P (w i+j |w i ) is computed based on Eq.(2).
Anchor Context Model This model aims to place similar words and entities close to one another in the vector space using hyperlinks and their neighboring words in Wikipedia. From a given Wikipedia page, the model extracts the referent entity and surrounding words (i.e., previous and next c words) from each hyperlink in the page, and learns embeddings by predicting surrounding words given each entity. Consequently, the loss function of this model is defined as follows: where A denotes a set of all hyperlinks in Wikipedia, each containing a pair of a referent entity e i and a set of surrounding words Q, and P (w c |e i ) is computed based on Eq. (2).
Link Graph Model This model aims to learn entity embeddings by predicting the neighboring entities of each entity in the Wikipedia's link graphan undirected graph whose nodes are entities and the edges represent the presence of hyperlinks between the entities. We create an edge between a pair of entities if the page of one entity has a hyperlink to that of the other entity, or if both pages link to each other. The loss function of this model is defined as: where E is the set of all entities in the vocabulary, and C e is the neighboring entities of entity e in the link graph, and P (e o |e i ) is computed by Eq.(2). Finally, we define the loss function of our model by linearly combining the three loss functions described above: The training is performed by minimizing this loss function using stochastic gradient descent. We use negative sampling (Mikolov et al., 2013b) to convert the softmax function (Eq.(2)) into computationally feasible ones. The resulting matrix V is used as the learned embeddings.

Automatic Generation of Hyperlinks
Because Wikipedia instructs its contributors to create a hyperlink only at the first occurrence of the entity name on a page, many entity names do not appear as hyperlinks. This is problematic for our anchor context model because it uses hyperlinks as a source to learn the embeddings.
To address this problem, our tool automatically generates hyperlinks using a mention-entity dictionary that maps entity names (e.g., "apple") to its possible referent entities (e.g., Apple Inc. or Apple (food)) (see Section 4 for details). Our tool extracts all words and phrases from a Wikipedia page and converts each into a hyperlink to an entity if either the entity is referred to by a hyperlink on the same page, or there is only one referent entity associated with the name in the dictionary.

Implementation
Our tool is implemented in Python and most of its code is compiled into C++ using Cython (Behnel et al., 2011) to optimize the run-time performance.
As described in Section 3.1, our link graph and anchor context models are based on the hyperlinks in Wikipedia. Because Wikipedia contains numerous hyperlinks, it is challenging to use them efficiently. To address this, we introduce two optimized components-link graph matrix and mentionentity dictionary-that are used during training.
Link Graph Matrix During training, our link graph model needs to obtain numerous neighboring entities of an entity in a large link graph of Wikipedia. To reduce latency, this component stores the entire graph in the memory using the binary sparse matrix in the compressed sparse row (CSR) format, in which its rows and columns represent entities and its values represent the presence of hyperlinks between corresponding entity pairs. Because the size of this matrix is typically small, it can easily be stored on the memory. 5 Note that given a row index in the CSR matrix, the time complexity of obtaining its non-zero column indices (corresponding to the neighboring entities of the entity that corresponds to the row index) is O(1). 5 The size of the matrix of English Wikipedia is less than 500 megabytes with our default hyper-parameter settings.
Mention-entity Dictionary A mention-entity dictionary is used to generate hyperlinks described in Section 3.2. The dictionary maps entity names to their possible referent entities and is created based on the names and their referent entities obtained from all hyperlinks in Wikipedia. Our tool extracts all words and phrases from a Wikipedia page that are included in the dictionary containing a large number of entity names. To implement this in an efficient manner, we use the Aho-Corasick algorithm, which is an efficient string search algorithm using finite state machine constructed from all entity names. After detecting the words and phrases in the dictionary, our tool converts them to hyperlinks based on heuristics described in Section 3.2.
The embeddings are trained by simultaneously iterating over pages in Wikipedia and entities in the link graph in a random order. The texts and hyperlinks in each page are extracted using the mwparserfromhell MediaWiki parser. 6 We do not use semi-structured data such as tables and infoboxes. We also generate hyperlinks using the mentionentity dictionary. We store the embeddings as a float matrix in a shared memory and update it using multiple processes. Linear algebraic operations required to learn embeddings are implemented using C functions in Basic Linear Algebra Subprograms (BLAS).
Additionally, our tool uses a tokenizer to detect words from a Wikipedia page. The following four tokenizers are currently implemented in our tool: (1) the multi-lingual ICU tokenizer 7 that implements the unicode text segmentation algorithm (Davis, 2019), (2) a simple rule-based tokenizer that splits the text using white space characters, (3) the Jieba tokenizer 8 for Chinese, and (4) the MeCab tokenizer 9 for Japanese and Korean.

Experiments
We conducted experiments to compare the quality and efficiency of our tool with those of the existing tools. To evaluate the quality of the entity embeddings, we used the KORE entity relatedness dataset (Hoffart et al., 2012). The dataset consists of 21 entities, and each entity has 20 related entities with scores assessed by humans. Following past work, we reported the Spearman's rank correlation co-  efficient between the gold scores and the cosine similarity between the entity embeddings. We used two popular entity embedding tools, RDF2Vec (Ristoski et al., 2018) and Wiki2vec, as baselines.
We also evaluated the quality of the word embeddings by employing two standard tasks: (1) a word analogy task using the semantic subset (SEM) and syntactic subset (SYN) of the Google Word Analogy data set (Mikolov et al., 2013a), and (2) a word similarity task using two standard datasets, namely SimLex-999 (SL) (Hill et al., 2015) and WordSim-353 (WS) (Finkelstein et al., 2002). Following past work, we reported the accuracy for the word analogy task, and the Spearman's rank correlation coefficient between the gold scores and the cosine similarity between the word embeddings for the word similarity task. As baselines for these tasks, we used the skip-gram model (Mikolov et al., 2013a) implemented in the gensim library 3.6.0 (Řehůřek and Sojka, 2010) and the extended skipgram model implemented in the fastText tool 0.1.0 (Bojanowski et al., 2017). We used WikiExtractor 10 to create the training corpus for baselines. To the extent possible, we used the same hyperparameters to train our models and the baselines. 11 We also reported the time required for training using our tool and the baseline word embedding tools. Note that the training of RDF2Vec and Wiki2vec tools are implemented using gensim.
We conducted experiments using Python 3.6 and OpenBLAS 0.3.3 installed on the c5d.9xlarge instance with 36 CPU cores deployed on Amazon Web Services. To train our models and the baseline word embedding models, we used the April 2018 version of the English Wikipedia dump. Table 1 shows the results of our models and the baseline entity embedding models of the KORE 10 https://github.com/attardi/ wikiextractor 11 We used the following settings: dim size = 500, window = 5, negative = 5, iteration = 5  dataset. 12 w/o link graph model and w/o hyperlink generation are the results of ablation studies disabling the link graph model and automatic generation of hyperlinks, respectively.

Results
Our model successfully outperformed the RDF2Vec and Wiki2vec models and achieved a state-of-the-art result on the KORE dataset. The results also indicated that the link graph model and automatic generation of hyperlinks improved the performance of the KORE dataset. Table 2 shows the results of our models with the baseline word embedding models on the word analogy and word similarity datasets. We also tested the performace of the word-based skip-gram model implemented in our tool by disabling the link graph and anchor context models.
Our model performed better than the baseline word embedding models on the SEM dataset, as well as on both word similarity datasets. This demonstrates that the semantic signals of entities provided by the link graph and anchor context models are beneficial for improving the quality of word embeddings. Additionally, the feature of the automatic generation of hyperlinks did not generally contribute to the performance on these datasets.
Our implementation of the word-based skipgram model was substantially faster than gensim and fastText. Furthermore, the training time of our full model was comparable to that of the baseline word embedding models.

Interactive Demonstration
We developed a web-based interactive demonstration that enables users to explore the embeddings of words and entities learned by our proposed tool (see Figure 4). This demonstration enables users to visualize the embeddings onto a two-or threedimensional space using three dimensionality reduction algorithms, namely t-distributed stochastic neighbor embedding (t-SNE) (Maaten and Hinton, 2008), uniform manifold approximation and projection (UMAP) (McInnes et al., 2018), and principal component analysis (PCA). Users can move around the visualized embedding space by dragging and zooming using the mouse. Moreover, the demonstration also allows users to explore the embeddings by querying similar items (words or entities) of an arbitrary item.
We used the pretrained embeddings of 12 languages released with this paper as the target embeddings. Furthermore, we also provided the English embeddings trained without the link graph model to allow users to qualitatively investigate how the link graph model affects the resulting embeddings.
Our demonstration is developed by extending the TensorFlow Embedding Projector. 13 The demonstration is available at https://wikipedia2vec. github.io/demo.

Use Cases
The embeddings learned using our proposed tool have already been used effectively in various recent studies. Poerner et al. (2019) have recently demonstrated that by combining BERT with the entity embeddings trained by our tool outperforms BERT and knowledge-enhanced contextualized word embeddings (i.e., ERNIE (Zhang et al., 2019)) on unsupervised question answering and relation classification tasks, without any computationally expensive additional pretraining of BERT. Yamada et al. (2018b) developed a neural network-based question answering system based on our tool, and won a competition held by the NIPS 2017 conference. Sato et al. (2017), Chen et al. (2019), and Yamada and Shindo (2019) achieved state-ofthe-art results on named entity recognition, entity linking, and text classification tasks, respectively, based on the embeddings learned by our tool. Furthermore, Papalampidi et al. (2019) proposed a neural network model of analyzing the plot structure of movies using the entity embeddings learned by our tool. Other examples include entity linking (Yamada et al., 2016;Eshel et al., 2017), named entity recognition (Lara-Clares and Garcia-Serrano, 2019), paraphrase detection (Duong et al., 2019), fake news detection , and knowledge graph completion (Shah et al., 2019).

Conclusions
In this paper, we present Wikipedia2Vec, an opensource tool for learning the embeddings of words and entities easily and efficiently from Wikipedia. Our experiments demonstrate the superiority of the proposed tool in terms of the quality of the embeddings and the efficiency of the training compared to the existing tools. Furthermore, our tool has been effectively used in many recent state-ofthe-art models, which indicates the effectiveness of our tool on downstream tasks. We also introduce a web-based interactive demonstration that enables users to explore the learned embeddings. The source code and the pre-trained embeddings for 12 languages are released with this paper.