A Comparison of Word Similarity Performance Using Explanatory and Non-explanatory Texts

Vectorial representations of words derived from large current events datasets have been shown to perform well on word similarity tasks. This paper shows vectorial representations derived from substantially smaller explanatory text datasets such as English Wikipedia and Simple English Wikipedia preserve enough lexical semantic information to make these kinds of category judgments with equal or better accuracy.


Introduction
Vectorial representations derived from large current events datasets such as Google News have been shown to perform well on word similarity tasks (Mikolov, 2013;Levy & Goldberg, 2014). This paper shows vectorial representations derived from substantially smaller explanatory text datasets such as English Wikipedia and Simple English Wikipedia preserve enough lexical semantic information to make these kinds of category judgments with equal or better accuracy. Analysis shows these results may be driven by a prevalence of commonsense facts in explanatory text. These positive results for relatively small datasets suggest vectors derived from slower but more accurate analyses of these resources may be practical for lexical semantic applications.

Wikipedia
Wikipedia is a free Internet encyclopedia website and the largest general reference work over the Internet. 1 As of December 2014, Wikipedia contained over 4.6 million articles 2 and 1.6 billion words. Wikipedia as a corpus has been heavily used to train various NLP models. Features of Wikipedia are well exploited in research like semantic web (Lehmann et al, 2014) and topic modeling (Dumais, 1988;Gabrilovich, 2007), but more importantly Wikipedia has been a reliable source for word embedding training because of its sheer size and coverage (Qiu, 2014), as recent word embedding models (Mikolov et al, 2013;Pennington et al, 2014) all use Wikipedia as an important corpus to build and evaluate their algorithms for word embedding creation.

Simple English Wikipedia
Simple English Wikipedia 3 is a Wikipedia database where all articles are written using simple English words and grammar. It is created to help adults and children who are learning English to look for encyclopedic information. Compared with full English Wikipedia, Simple English Wikipedia is much smaller. It contains around 120,000 articles and 20 million words, which is almost one fortieth the number of articles and one eightieth the number of words compared to full English Wikipedia, so the average length of articles is also shorter. Simple English Wikipedia is often used in simplification research (Coster, 2011;Napoles, 2010) where sentences from full English Wikipedia are matched to sentences from Simple English Wikipedia to explore techniques to simplify sentences. It would be reasonable to expect that the small vocabulary size of Simple English Wikipedia may be disadvantageous when trying to create word embeddings using co-occurrence information, but it may also be true that despite the much smaller vocabulary size and overall size, because of the explanatory nature of its text, Simple English Wikipedia would still preserve enough information to allow the performance of models trained with Simple English Wikipedia to be comparable to models trained on full Wikipedia, and perform equally well or better than non-explanatory texts like the Google News corpus.

Word2Vec
The distributed representation of words, or word embeddings, has gained significant attention in the research community, and one of the more discussed works is Mikolov's (2013) word representation estimation research. Mikolov proposed two neusral network based models for word representation: Continuous Bag-of-Words (CBOW) and Skip-gram. CBOW takes advantage of context words surrounding a given word to predict the word by summing all the context word vectors together to represent the word; whereas Skip-gram uses the word to predict the context word vectors for skip-gram positions, therefore making the model sensitive to positions of context words. Both of the models scale well to large quantities of training data, however it is noted by Mikolov that Skipgram works well with small amounts of training data and provides good representations for rare words, and CBOW would perform better and have higher accuracy for frequent words if trained on larger corpora. The purpose of this paper is not to compare the models, but to use the models to compare training corpora to see how different arrangement of information may impact the quality of the word embeddings.

Task Description
To evaluate the effectiveness of full English Wikipedia and Simple English Wikipedia as training corpora for word embeddings, the word similarityrelatedness task described by Levy & Goldberg (2014) is used. As pointed out by Agirre et al (2009) and Levy & Goldberg (2014), relatedness may actually be measuring topical similarity and be better predicted by a bag-of-words model, and similarity may be measuring functional or syntactic similarity and be better predicted by a contextwindow model. However, when the models are constant, the semantic information of the test words in the training corpora is crucial to allowing the model to build semantic representations for the words. It may be argued that when the corpus is explanatory, more semantic information about the target words is present; whereas when the corpus is non-explanatory, information around the words is merely related to the words. The WordSim353 (Agirre, 2009) dataset is used as the test dataset. This dataset contains pairs of words that are decided by human annotators to be either similar or related, and a similarity or relatedness gold standard score is also given to every pair of words. There are 100 similar word pairs, 149 related pairs and 104 pairs of words with very weak or no relation. In the evaluation task, the unrelated word pairs are discarded from the dataset. The objective of the task is to rank the similar word pairs higher than related ones. The retrieval/ranking procedure is as follows. First, the cosine similarity scores are calculated using word embeddings from a certain model; then the scores are sorted from the highest to the lowest. The retrieval step is then carried out by locating the last pair of the first n% of the pairs of similar words in the sorted list of scores and determining the percentage of similar word pairs in the sub-list delimited by the last pair of similar words. In other words, the procedure treats similar word pairs as successful retrievals and determines the accuracy rate when the recall rate is n%. Because the accuracy rate would always fall to the percentage of similar word pairs in all word pairs, it is expected that the later and more suddenly it falls, the better the model is performing in this task.

Models
The word2vec python implementation provided by gensim (Rehurek et al, 2010) package is used to train all the word2vec models. For Skip-gram and CBOW, a 5-word window size is used to allow them to get the same amount of raw information, also words appearing 5 times or fewer are filtered out. The dimensions of the word embeddings from Skip-gram and CBOW are all 300. Both full English Wikipedia and Simple English Wikipedia are used as training corpora with minimal preprocessing procedures: XML tags are removed and infoboxes are filtered out, thus yielding four models: Full English Wikipedia -CBOW(FW-CBOW), Full English Wikipedia -Skip-gram(FW-SG), Simple English Wikipedia -CBOW(SW-CBOW) and Simple English Wikipedia -Skip-gram(SW-SG). The pre-trained Google News skip-gram model with 300-dimensional vectors (GN-SG) is also downloaded from the Google word2vec website for comparison. This model is trained on the Google News dataset with 100 billion words, which is 30 times as large as the full English Wikipedia and 240 times as large as Simple English Wikipedia. Table 1 shows the accuracy rate at every recall rate point, with the sum of all the accuracy rates as the cumulative score. It is shown that GN-SG, although not far behind, is not giving the best performance despite being trained on the largest dataset. In fact, it is clear that it never excels at any given recall rate point. It outperforms various models at certain recall rate points by a small margin, but there is no obvious advantage gained from training using a much larger corpus even when compared with the models trained on Simple English Wikipedia, despite the greater risk of sparse data problems on this smaller data set.

Results
For models trained on Simple English Wikipedia and full English Wikipedia, it is also interesting to see that the models almost perform equally well. The FW-CBOW trained on full English Wikipedia performs the best among the models overall, but for the first few recall rate points, it performs equally well or slightly worse than either SW-CBOW or SW-SG trained on Simple English Wikipedia. At the later points, it is also clear that although FW-CBOW is generally better than all the other models most of the time, the margin could be considered narrow and furthermore it is equally as good as SW-CBOW at the first two recall points.
Comparing FW-SG with SW-SG and SW-CBOW, there is almost no sign of performance gain from training using full Wikipedia instead of the much smaller Simple Wikipedia. FW-SG performs equally well or often slightly worse than both Simple Wikipedia models.
The main observation in this paper is that Google News is not out-performing other systems substantially and that full Wikipedia systems are not out-performing Simple Wikipedia substantially (that is, comparing the CBOW models to one another and the Skip-gram models to one another). The main result from the table is not that smaller training datasets yield better systems, but that systems trained using significantly smaller training datasets of explanatory text have very close performances in this task compared with systems trained on very large datasets, despite the big training data size difference.

Analysis
As mentioned previously, similarity may be better predicted by a context-window model because it measures functional or syntactic similarity. However, it is not clear in these models that the syntactic information is a major component in the word embeddings. Instead, it may be that the main factor for the performance level of the models is the general explanatory content of the Wikipedia articles, as opposed to the current events content of Google News.
For similar words such as synonyms or hyponyms, the crucial information making them similar is shared general semantic features of the words. For example, for the word pair physics : chemistry, the shared semantic features might be that they are both academic subjects, both studied in institutions and both composed of different subfields, as shown in Table 2. The '@' sign in table 2 connects a context word with its position relative to the word in the center of the window. These shared properties of the core semantic identities for these words may contribute greatly to the similarity judgments for humans and machines alike, and these shared properties may be considered general knowledge about the words. For the related words, for example computer : keyboard, it may be difficult to pinpoint the semantic overlap between the components which build up the core semantic identities of these words, and none is observed in the data. General knowledge of a certain word may be found in explanatory texts about the word like dictionaries or encyclopedias, but rarely found in texts other than that. It would be assumed by the writers of informative non-explanatory texts like news articles that the readers are well acquainted with all the basic semantic information about the words, therefore repetition of such information would be unnecessary. For a similarity/relatedness judgment task where basic and compositional semantic information may prove to be useful, using a corpus like Google News, where information or context for a particular word assumes one is already con-versant with it, would not be as effective as using a corpus like Wikipedia where general knowledge about a word may be available and repeated. Also, the smaller vocabulary size of Wikipedia compared with Google News would suggest that general knowledge may be conveyed more efficiently with less data sparsity.

Conclusion
This paper has shown vectorial representations derived from substantially smaller explanatory text datasets such as Wikipedia and Simple Wikipedia preserve enough lexical semantic information to make these kinds of category judgments with equal or better accuracy than news corpora. Analysis shows these results may be driven by a prevalence of commonsense facts in explanatory text. These positive results for small datasets suggest vectors derived from slower but more accurate analysis of these resources may be practical for lexical semantic applications, and we hope by providing this result, future researchers may be more aware of the viability of smaller-scale resources like Simple English Wikipedia (or presumably Wikipedia in other languages which are substantially smaller in size than English Wikipedia), that can still produce high quality vectors despite a much smaller size.