Neural Attentive Bag-of-Entities Model for Text Classification

This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for text classification. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model is available online at https://github.com/wikipedia2vec/wikipedia2vec.


Introduction
Text classification is an important task, and its applications span a wide range of activities such as topic classification, spam detection, and sentiment classification. Recent studies showed that models based on neural networks can outperform conventional models (e.g., naïve Bayes) on text classification tasks (Kim, 2014;Iyyer et al., 2015;Tang et al., 2015;Dai and Le, 2015;Jin et al., 2016;Joulin et al., 2017;Shen et al., 2018). Typical neural network-based text classification models are based on words. They typically use words in the target documents as inputs, map words into continuous vectors (embeddings), and capture the semantics in documents by using compositional functions over word embeddings such as averaging or summation of word embeddings, convolutional neural networks (CNN), and recurrent neural networks (RNN).
Apart from the aforementioned approaches, past studies attempted to use entities in a knowledge base (KB) (e.g., Wikipedia) to capture the semantics in documents. These models typically represent a document by using a set of entities (or bag of entities) relevant to the document Markovitch, 2006, 2007;Xiong et al., 2016). The main benefit of using entities instead of words is that unlike words, entities provide unambiguous semantic signals because they are uniquely identified in a KB. One key issue here is to determine the way in which to associate a document with its relevant entities. An existing straightforward approach Xiong et al., 2016) involves creating a set of relevant entities using an entity linking system to detect and disambiguate the names of entities in a document. However, this approach is problematic because (1) entity linking systems produce disambiguation errors (Cornolti et al., 2013), and (2) entities appearing in a document are not necessarily relevant to the given document (Gamon et al., 2013;Dunietz and Gillick, 2014).
This study proposes the Neural Attentive Bagof-Entities (NABoE) model, which is a neural network model that addresses the text classification problem by modeling the semantics in the target documents using entities in the KB. For each entity name in a document (e.g., "Apple"), our model first detects entities that may be referred to by this name (e.g., Apple Inc., Apple (food)), and then represents the document using the weighted average of the embeddings of these entities. The weights are computed using a novel neural attention mechanism that enables the model to focus on a small subset of the entities that are less ambiguous in meaning and more relevant to the document. In other words, the attention mechanism is designed to compute weights by jointly addressing entity linking and entity salience detection (Ga-mon et al., 2013;Dunietz and Gillick, 2014) tasks. Furthermore, the attention mechanism improves the interpretability of the model because it enables us to inspect the small number of entities that strongly affect the classification decisions.
We validate the effectiveness of our proposed model by addressing two important natural language tasks: a text classification task using two standard datasets (i.e., the 20 Newsgroups and R8 datasets), and a factoid question answering task based on a popular dataset derived from the quiz bowl trivia quiz game. As a result, our model achieved state-of-the-art results on both tasks. The source code of the proposed model is available online at https://github.com/ wikipedia2vec/wikipedia2vec.

Our Approach
Given a document, our model addresses the text classification task by using the following two steps: it first detects entities from the document, and then classifies the document using the proposed model with the detected entities as inputs.

Entity Detection
In this step, we detect entities that may be relevant to the document. Here, we use a simple method based on an entity dictionary that maps an entity name (e.g., "Washington") to a set of possible referent entities (e.g., Washington, D.C. and George Washington). In particular, we first take all words and phrases in a document, treat them as entity names if they exist in the dictionary, and detect all possible referent entities for each detected entity name. Following past work (Hasibi et al., 2016;Xiong et al., 2016), the boundary overlaps of the names are resolved by detecting only those that are the earliest and the longest.
We use Wikipedia as the target KB, and the entity dictionary is built by using the names and their referent entities of all internal anchor links in Wikipedia (Guo et al., 2013). We also collect two statistics from Wikipedia, namely link probability and commonness (Mihalcea and Csomai, 2007;Milne and Witten, 2008). The former is the probability of a name being used as an anchor link in Wikipedia, whereas the latter is the probability of a name referring to an entity in Wikipedia.
We generate a list of entities by concatenating all possible referent entities contained in the dictionary for each detected entity name, and feed it to the model presented in the next section. Note that we do not disambiguate entity names here, but detect all possible referent entities of the entity names. Figure 1 shows the architecture of our model. Given words w 1 , ..., w N , and entities e 1 , ..., e K detected from target document D, we first compute the word-based representation of D:

Model
where v w ∈ R d is the embedding of word w. We then derive the entity-based representation of D as a weighted average of the embeddings of the entities: where v e ∈ R d is the embedding of entity e and a e the normalized attention weight corresponding to e computed using the following softmax-based attention function: where w a ∈ R l is a weight vector, b a ∈ R is the bias, and Φ(e, D) is a function that generates an l-dimensional vector consisting of the features of the attention function.
We use the following two features in the attention function: • Cosine: the cosine similarity between the embedding of the entity v e and the wordbased representation of the document z word .
• Commonness: the probability that the entity name refers to the entity in KB.
Here, our aim is to capture the relevance and the unambiguity of entity e in document D using the attention function. Thus, the problem is related to the tasks of entity salience detection (Gamon et al., 2013;Dunietz and Gillick, 2014), which aims to detect entities relevant (or salient) to the document, and entity linking, which aims to resolve the ambiguity of entities. The key assumption relating to these two tasks in the literature is that if an entity is semantically related to the given document, it is relevant to the document (Dunietz and Gillick, 2014), and it is likely to appear in the document (Milne and Witten, 2008;Ratinov et al., 2011). With this in mind and following past work (Yamada et al., 2016), we use the cosine similarity between v e and z word as a feature. Further, as in past entity linking studies, we also use the commonness of the name referring to the entity. Moreover, we derive a representation based both on entities and words by simply adding z entity and z word 1 : We then solve the task using a multiclass logistic regression classifier with the computed representation (i.e., with z entity or z f ull ) as features. In the remainder of this paper, we denote our models based on z entity and z f ull by NABoE-entity and NABoE-full, respectively.

Experimental Setup
In this section, we describe our experimental setup used both in the text classification and the factoid question answering experiments presented below.

Entity Detection
As the target KB, we used the September 2018 version of Wikipedia, which contains a total of 7,333,679 entities. 2 Regarding the entity dictionary described in Section 2.1, we excluded an entity name if its link probability was lower than 1% and a referent entity if its commonness given the entity name was lower than 3% for computational efficiency. Entity names were treated as case-insensitive. As a result, the dictionary contained 18,785,550 entity names, and each name had 1.14 referent entities on average. Furthermore, to detect entities from a document, we also tested two publicly available entity linking systems, Wikifier (Ratinov et al., 2011;Cheng and Roth, 2013) and TAGME (Ferragina and Scaiella, 2012), instead of using dictionarybased entity detection. 3 We selected these systems because they are capable of detecting non-named entities (e.g., technical terms) that are useful for addressing the text classification task. 4 Here, we used the entities detected and disambiguated by these systems as inputs to our neural network model.

Pretrained Embeddings
We initialized the embeddings of words (v w ) and entities (v e ) using pretrained embeddings trained on KB. To learn embeddings from the KB, we used the method adopted in the open source Wikipedia2Vec tool (Yamada et al., 2016(Yamada et al., , 2018a. In particular, we generated an entity-annotated corpus from Wikipedia by treating entity links in Wikipedia articles as entity annotations, and trained skip-gram embeddings (Mikolov et al., 2013a,b) of 300 dimensions with negative sampling using the generated corpus as inputs. The learned embeddings place similar words and entities close to one another in a unified vector space. Here, we used the same version of Wikipedia described in Section 3.1.

Text Classification
To evaluate the effectiveness of our proposed model, we first conducted the text classification task on two standard datasets, namely the 20 Newsgroups (20NG) (Lang, 1995) and R8 datasets (Debole and Sebastiani, 2005).

Setup
Our experimental setup described in this section follows that in past work Jin et al., 2016;Yamada et al., 2018b). In particular, we used the 20NG and R8 datasets to train and test the proposed model. The 20NG dataset was created using the documents obtained from 20 Newsgroups and contained 11,314 training documents and 7,532 test documents. 5 The R8 dataset consisted of news documents from the eight most popular classes of the Reuters-21578 corpus (Lewis, 1992) and comprised 5,485 training documents and 2,189 test documents. We created the development set for each dataset by selecting 5% of the documents for training. Note that the class distribution of the R8 dataset is highly imbalanced. For example, the number of documents in the largest and smallest classes is 3,923 documents and 51 documents, respectively.
We report the accuracy and macro-average F1 scores. The model was trained using mini-batch stochastic gradient descent (SGD) with its batch size set to 32 and its learning rate controlled by Adam (Kingma and Ba, 2014). We used words and entities that were detected three times or more in the dataset and ignored the other words and entities. The size of the embeddings of words and entities was set to d = 300. We used early stopping based on the accuracy of the development set of each dataset to avoid overfitting of the model.

Baselines
We used the following models as our baselines: The performance of this model was superior to that of many state-of-the-art models, including those based on the skip-gram and CBOW models (Mikolov et al., 2013b), and the paragraph vector model (Le and Mikolov, 2014).
• SWEM-concat (Shen et al., 2018): This model is based on a neural network model with simple pooling operations (i.e., average and max pooling) over pretrained word embeddings. 6 Despite its simplicity, it outperformed many neural network-based models such as the word-based CNN model (Kim, 2014) and RNN model with LSTM units (Shen et al., 2018).
• TextEnt (Yamada et al., 2018b): This model learns entity-aware document embeddings from Wikipedia, and uses a neural network model with the learned embeddings as pretrained parameters to address text classification.
As described in Section 2.1, we also tested the variants of our NABoE-entity and NABoEfull models for which Wikifier and TAGME were used as the entity detection methods. Table 1 shows the results of our models and those of our baselines. Here, w/o att. and w/o emb. signify the model without the neural attention mechanism (all attention weights a e are set to 1 K , where K is the number of entities in the document) and the model without the pretrained embeddings (the embeddings are initialized randomly), respectively.

Results
Relative to the baselines, our models yielded enhanced overall performance on both datasets. The NABoE-full model outperformed all baseline models in terms of both measures on both datasets. Furthermore, the NABoE-entity model outperformed all the baseline models in terms of both measures on the 20NG dataset, and the F1 score on the R8 dataset. Moreover, our attention mechanism consistently improved the performance. These results clearly highlighted the effectiveness of our approach, which addresses text classification by using a small number of unambiguous and relevant entities detected by the proposed attention mechanism. Moreover, the pretrained embeddings improved the performance on both datasets. Further, the models based on the dictionarybased entity detection (see Section 2.1) generally outperformed the models based on the entity linking systems (i.e., Wikifier and TAGME). We consider that this is because these entity linking systems failed to detect or disambiguate entity names that were useful to address the text classification task. Moreover, our attention mechanism consistently improved the performance for Wikifierand TAGME-based models because the attention mechanism enabled the model to focus on entities that were relevant to the document.

Analysis
In this section, we provide a detailed analysis of the performance of our model in terms of conducting the text classification task. We first provide a comparison of the SWEM-concat, NABoEentity, and NABoE-full models using class-level F1 scores on both of the datasets (see Table 2). Here, we aim to compare the detailed performance of the word-based model (SWEM-concat), entitybased model (NABoE-entity), and the model based on both words and entities (NABoE-full). Compared with the SWEM-concat model, the NABoE-full and NABoE-entity models performed    similar classes in the R8 dataset based only on entities.
Next, we conducted a feature study of the attention mechanism by excluding one feature at a time from the NABoE-entity model (Table 3). We found both of the features to make an important contribution to the performance. Furthermore, to investigate the attention mechanism in more detail, we computed the top influential entities in the attention mechanism for each class on the 20NG and R8 datasets. In particular, we calculated the number of times each entity obtained the highest attention weight in the test documents in each class and selected the five most frequent ones. Table 4 presents the results. Overall, our attention mechanism successfully selected entities that were highly relevant to each class. For example, Cryptography, Algorithm, Escrow, Considered harmful, and Encryption were selected for the sci.crypt class. Furthermore, although we did not explicitly perform entity disambiguation, the model successfully overcame the ambiguity issues in the entity names and attended to the entities that were relevant to the classes.
Subsequently, we conducted an error analysis by selecting 50 random test documents for which the NABoE-entity model made wrong predictions. Most of the errors were caused by two pairs of classes: 22 errors were caused by misclassifying documents of acq (corporate acquisitions) and those of earn (corporate earnings), and 13 errors were caused by misclassifying documents of interest and those of money-fx. Furthermore, the model tended to perform poorly if a document contained entities that strongly indicate an incorrect class. For example, a money-fx document containing the entity interest rate multiple times was classified into the interest class, and a document in the acq class reporting news related to oil companies (i.e., ExxonMobil and ZENEX) was classified into the crude class.

Factoid Question Answering
In this section, we address factoid question answering based on a dataset consisting of questions of the quiz bowl trivia quiz game. Factoid ques-tion answering is one of the common settings of question answering that aims to predict an entity (e.g., events, authors, and books) that is described in a given question. The players of quiz bowl solve questions consisting of sentences that describe an entity. Quiz bowl questions have frequently been used for evaluating neural network-based models in recent studies (Iyyer et al., 2014(Iyyer et al., , 2015.
This task has a significantly larger number of target classes compared to the task addressed in the previous experiment. Our main aim here is to evaluate the effectiveness of using entities to capture the finer-grained semantics required to perform the task of factoid question answering effectively.

Setup
Our experimental setup described in this section follows that in past work (Xu and Li, 2016;. We address this task as a text classification problem that selects the most relevant answer from the possible answers observed in the dataset. We obtained the dataset proposed in Iyyer et al. (2014) 7 . We only used questions in the history and literature categories. Furthermore, we excluded questions of which the answers appear fewer than six times in the dataset. As a result, the number of candidate answers was 303 and 424 in the history and literature categories, respectively. We used 20% of questions each for the development set and test sets, and the remaining 60% for the training set. As a result, the training, development, and test sets consisted of 1,535, 511, and 511 questions for the history category, and 2,524, 840, and 840 questions for the literature category.
The settings we used to train the model were the same as those in the previous experiment (see Section 4.1). The model was trained using mini-batch SGD with its learning rate controlled by Adam (Kingma and Ba, 2014) and its mini-batch size set to 32. We used words and entities that were detected three times or more in the dataset, and ignored the other words and entities. The size of the embeddings of words and entities was set to d = 300. As in past work, we report the accuracy score, and the score on the development set was used for early stopping. 7 This dataset was downloaded from the authors' web page: https://cs.umd.edu/˜miyyer/qblearn/.  Table 5: Accuracy of the proposed and baseline methods for the factoid QA task.

Baselines
We used the following baseline models: • BoW (Xu and Li, 2016) This model is based on a logistic regression classifier with conventional binary BoW features.
• FTS-BRNN (Xu and Li, 2016) This model is based on a bidirectional RNN with gated recurrent units (GRU). It uses the logistic regression classifier with the features derived by the RNN.
• NTEE  This model is a state-of-the-art model that uses a multi-layer perceptron classifier with the features computed using the embeddings of words and entities trained on Wikipedia using the neural network model proposed in their paper.
Similar to our previous experiment, we also add SWEM-concat, and the variants of our NABoEentity and NABoE-full models based on Wikifier and TAGME (see Section 4.2). Note that all the baselines address the task as a text classification problem. Table 5 provides the results of our models and those of our baselines. Overall, our models achieved enhanced performance on this task. In particular, the NABoE-full model successfully outperformed all the baseline models, and the NABoE-entity model achieved competitive performance and outperformed all the baseline models in the literature category. These results clearly highlighted the effectiveness of our model for this task.

Results and Analysis
Furthermore, similar to the previous text classification experiment, the attention mechanism and the pretrained embeddings consistently improved the performance. Moreover, the models based on dictionary-based entity detection outperformed the models based on the entity linking systems.
We also conducted an error analysis using the NABoE-entity model and the test questions in the history category. We found nearly 70% of the errors to be caused by questions of which the answers were country names. This is because these questions tended to provide indirect clues (e.g., describing a notable person born in the country) and most entities used in these clues do not directly indicate the answer (i.e., country names). Furthermore, our model failed in difficult cases such as predicting Tokugawa shogunate instead of Tokugawa Ieyasu.

Related Work
KB entities have been conventionally used to model the semantics in texts. A representative example is Explicit Semantic Analysis (ESA) Markovitch, 2006, 2007), which represents a document using a bag of entities, namely a sparse vector of which each dimension corresponds to the relevance score of the text to each entity. This simple method is shown to be effective for various NLP tasks including text classification (Gabrilovich and Markovitch, 2006;Gupta and Ratinov, 2008;Negi and Rosner, 2013) and information retrieval (Egozi et al., 2011;Xiong et al., 2016), Several neural network models that use KB entities to capture the semantics in texts have been proposed. These models typically depend on an additional preprocessing step that extracts the relevant entities from the target texts. For example, Wang et al. (2017) used the Probase conceptualization API for short text classification by retrieving the Probase entities that were relevant to the target text and used them in a model based on CNN. Pilehvar et al. (2017) also extracted entities using a graph-based linking algorithm and used these entities in a neural network model. A similar approach was adopted in Yamada et al. (2018b,c); they extracted entities from the target text using an entity linking system and simply used the detected entities in a neural network model. However, un-like these models, our proposed model addresses the task in an end-to-end manner; i.e., entities that are relevant to the target text are automatically selected using our neural attention mechanism. Furthermore, we also used the model proposed by Yamada et al. (2018b) as a baseline in our text classification experiments.
Additionally, our work is also related to studies on entity linking. Entity linking models can be roughly classified into two groups: local models, which resolve entity names independently using the contextual relevance of the entity given a document, and global models, in which all the entity names in a document are resolved simultaneously to select a topically coherent set of results (Ratinov et al., 2011). Recent state-of-the-art models typically combine both of these models (Yamada et al., 2016;Ganea and Hofmann, 2017;Cao et al., 2018;Kolitsas et al., 2018). However, several studies also showed that the local model alone can achieve results competitive to those of the global and combined models (Eshel et al., 2017;Ganea and Hofmann, 2017;Cao et al., 2018;Kolitsas et al., 2018). In this study, we adopt a simple but effective local model, which uses cosine similarity between the embedding of the target entity and the word-based representation of the document to capture the relevance of an entity given a document.

Conclusions
This study proposed NABoE, which is a neural network model that performs text classification using entities in Wikipedia. We combined simple dictionary-based entity detection with a neural attention mechanism to enable the model to focus on a small number of unambiguous and relevant entities in a document. We achieved state-of-theart results on two important NLP tasks, namely text classification and factoid question answering, which clearly verified the effectiveness of our approach. As a future task, we intend to more extensively analyze our model and explore its effectiveness for other NLP tasks. Furthermore, we would also like to test more expressive neural network models for example by integrating global entity coherence information into our neural attention mechanism.