Corpus specificity in LSA and Word2vec: the role of out-of-domain documents

Latent Semantic Analysis (LSA) and Word2vec are some of the most widely used word embeddings. Despite the popularity of these techniques, the precise mechanisms by which they acquire new semantic relations between words remain unclear. In the present article we investigate whether LSA and Word2vec capacity to identify relevant semantic dimensions increases with size of corpus. One intuitive hypothesis is that the capacity to identify relevant dimensions should increase as the amount of data increases. However, if corpus size grow in topics which are not specific to the domain of interest, signal to noise ratio may weaken. Here we set to examine and distinguish these alternative hypothesis. To investigate the effect of corpus specificity and size in word-embeddings we study two ways for progressive elimination of documents: the elimination of random documents vs. the elimination of documents unrelated to a specific task. We show that Word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus. On the contrary, the specialization (removal of out-of-domain documents) of the training corpus, accompanied by a decrease of dimensionality, can increase LSA word-representation quality while speeding up the processing time. Furthermore, we show that the specialization without the decrease in LSA dimensionality can produce a strong performance reduction in specific tasks. From a cognitive-modeling point of view, we point out that LSA's word-knowledge acquisitions may not be efficiently exploiting higher-order co-occurrences and global relations, whereas Word2vec does.

LSA takes as input a training Corpus formed by a collection of documents. Then a word by document co-occurrence matrix is constructed, which contains the distribution of occurrence of the different words along the documents. Then, usually, a mathematical transformation is applied to reduce the weight of uninformative high frequency words in the words-documents matrix (Dumais, 1991). Finally, a linear dimensionality reduction is implemented by a truncated Singular Value Decomposition, SVD, which projects every word in a subspace of a predefined number of dimensions, k. The success of LSA in capturing the latent meaning of words comes from this low-dimensional mapping. This representation improvement can be explained as a consequence of the elimination of the most noisy dimensions (Turney & Pantel, 2010).
Word2vec consists of two neural network models, Continuous Bag of Words (CBOW) and Skip-gram. To train the models, a sliding window is moved along the corpus. In the CBOW scheme, in each step the neural network is trained to predict the center word (the word in the center of the window based) given the context words (the other words in the window). While in the skip-gram scheme, the model is trained to predict the context words based on the central word. In the present paper we use the skip-gram, which has produced better performance in .
Despite the development of new word representation methods, LSA is still intensively used, and has been shown that produce better performances than Word2vec methods in small to medium size training corpus (Altszyler, Ribeiro, Sigman, & Slezak, 2017).

Training Corpus Size and Specificity in Word-embeddings
Over the last years, great effort has been devoted to understand how to choose the right parameter settings for different tasks (Baroni, Dinu, & Kruszewski, 2014;Bradford, 2008;Dumais, 2003;Landauer & Dumais, 1997;Lapesa & Evert, 2014;Nakov, Valchanova, & Angelova, 2003;Quesada, 2011). However, considerably lesser attention has been given to study how different corpus used as input for training may affect the performance. Here we ask a simple question on the property of the corpus: is there a monotonic relation between corpus size and the performance? More precisely, what happens if the topic of additional documents differ from the topics in the specific task? Previous studies have surprisingly shown some contradictory results on this simple question.
On the one hand, in the foundational work, Landauer et al. (Landauer & Dumais, 1997) compare the word-knowledge acquisition between LSA and that of children's. This acquisition process may be produced by 1) direct learning, enhancing the incorporation of new words by reading texts that explicitly contain them; or 2) indirect learning, enhancing the incorporation of new words by reading texts that do not contain them. To do that, they evaluate LSA semantic representation trained with different size corpus in multiple-choice synonym questions extracted from the TOEFL exam. This test consists in 80 multiple-choice questions, in which its requested to identify the synonym of a word between 4 options. In order to train the LSA, Landauer and Dumais used the TASA corpus (Zeno, Ivens, & Millard, 1995).
Landauer et al. (Landauer & Dumais, 1997) randomly replaced exam-words in the corpus with non-sense words and varied the number of corpus' documents selecting nested sub-samples of the total corpus. They concluded that LSA improves its performance on the exam both when training with documents with exam-words and without them. However, as could be expected, they observed a grater effect when training with examwords. It is worth mentioning that the replacement of exam-words with non-sense words may create incorrect documents, thus, making the algorithm acquire word-knowledge from documents which should have an exam-word but do not. In the Results section, we will study this indirect word acquisition in the TOEFL test without using non-sense words.
Along the same line, (Lemaire & Denhiere, 2006) studied the effect of high-order cooccurrences in LSA semantic similarity, which goes further in the study of Landauer's indirect word acquisition.
In their work, Landauer et al. (Lemaire & Denhiere, 2006) measure how the similarity between 28 pairs of words (such as bee/honey and buy/shop) changes when a 400-dimensions LSA is trained with a growing number of paragraphs. Furthermore, they identify for this task the marginal contribution of the first, second and third order of co-occurrence as the number of paragraphs is increased. In this experiment, they found that not only does the first order of co-occurrence contribute to the semantic closeness of the word pairs, but also the second and the third order promote an increment on pairs similarity. It is worth noting that Landauer's indirect word acquisition can be understood in terms of paragraphs without either of the words in a pair, and containing a third or more order co-occurrence link.
So, the conclusion from the Lemaire et al. (Lemaire & Denhiere, 2006) and Landauer et al. (Landauer & Dumais, 1997) studies suggest that increasing corpus size results in a gain, even if this increase is in topics which are unrelated for the relevant semantic directions which are pertinent for the task.
However, a different conclusion seems to result from other set of studies. Stone, Dennis, Kwantes, Tor (2006) have studied the effect of Corpus size and specificity in a document similarity rating task. They found that training LSA with smaller subcorpus selected for the specific task domain maintains or even improves LSA performance. This corresponds to the intuition of noise filtering, when removing information from irrelevant dimensions results in improvements of performance.
In addition, Olde et al. (Olde, Franceschetti, Karnavat, Graesser, & Group, 2002) have studied the effect of selecting specific subcorpus in an automatic exam evaluation task. They created several subcorpus from a Physics corpus, progressively discarding documents unrelated to the specific questions. Their results showed small differences in the performance between the LSA trained with original corpus and the LSA trained with the more specific subcorpus.
It is well known that the number of LSA dimensions (k ) is a key parameter to be duly adjusted in order to eliminate the most noisy dimensions (Landauer & Dumais, 1997;Turney & Pantel, 2010). Excessively high k values may not eliminate enough noisy dimensions, while excessively low k values may not have enough dimensions to generate a proper representation. In this context, we hypothesize that when out-of-domain documents are discarded, the number of dimensions needed to represent the data should be lower, thus, k must be decreased.
Regarding Word2vec, Cardellino and Alemany (2017) and Dusserre and Padró (2017) have shown that Word2vec trained with a specific corpus can produce better performance in semantic tasks than when it is trained with a bigger and general corpus. Despite these works point out the relevance of domain-specific corpora, they do not study the specificity in isolation, as they compare corpus from different sources.
In this article, we set to investigate the effect of the specificity and size of training corpus in word-embeddings, and how this interacts with the number of dimensions. To measure the semantic representations quality we have used two different tasks: the TOEFL exam, and a categorization test. The corpus evaluation method consists in the comparison between two ways of progressive elimination of documents: the elimination of random documents vs the elimination of out-of-domain documents (unrelated to the specific task). In addition, we have varied k within a wide range of values.
As we show, LSA's dimensionality plays a key role in the LSA representation when the corpus analysis is made. In particular, we observe that both, discarding out-ofdomain documents and decreasing the number of dimensions produces an increase in the algorithm performance. In one of the two tasks, discarding out-of-domain documents without the decrease of k results in the complete opposite behavior, showing a strong performance reduction. On the other hand, Word2vec shows in all cases a performance reduction when discarding out-of-domain, which suggests an exploitation of higher-order word co-occurrences.
Our contribution in understanding the effect of out-of-domain documents in wordembeddings knowledge acquisitions is valuable from two different perspective: • From an operational point of view: we show that LSA's performance can be enhanced when: (1) its training corpus is cleaned from out-of-domain documents, and (2) a reduction of LSA's dimensions number is applied. Furthermore, the reduction of both the corpus size and the number of dimensions tend to speed up the processing time. On the other hand, word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus.

Methods
We used TASA corpus (Zeno et al., 1995) in all experiments. TASA is a commonly used linguistic corpus consisting of more than 37 thousand educational texts from USA K12 curriculum. We word-tokenized each document, discarding punctuation marks, numbers and symbols. Then, we transformed each word to lowercase and eliminated stopwords, using the stoplist in NLTK Python package (Bird, Klein, & Loper, 2009). TASA corpus contains more than 5 million words in its cleaned version. In each experiment, the training corpus size was changed by discarding documents in two different ways: • Random documents discarding: The desired number of documents (n) contained in the subcorpus is preselected. Then, documents are randomly eliminated from the original corpus until there are exactly n documents. If any of the test words (i.e. words that appear in the specific task) does not appear at least once in the remaining corpus, one document is randomly replaced with one of the discarded documents that contains the missing word.
• Out-of-domain documents discarding: The desired number of documents (n) contained in the subcorpus is preselected. Then, only documents with no test words are eliminated from the original corpus until there are exactly n documents. Here, n must be greater than or equal to the number of documents that contain at least one of the test words.
Both, LSA and Skip-gram word-embeddings were generated with Gensim Python library (Řehůřek & Sojka, 2010). In LSA implementation, a Log-Entropy transformation was applied before the truncated Singular Value Decomposition. In Skip-gram implementation, we discarded tokens with frequency higher than 10 −3 , and we set the window size and negative sampling parameters to 15 (which were found to be maximal in two semantic tasks over TASA corpus (Altszyler et al., 2017)). In all cases, word-embeddings dimensions values were varied to study its dependency. The semantic similarity (S) of two words was calculated using the cosine similarity measure between their respective vectorial representation (v 1 ,v 2 ), The semantic distances between two words d(v 1 , v 2 ) is calculated as 1 minus the semantic similarity ( d(v 1 , v 2 ) = 1 − S(v 1 , v 2 )).
Word-embeddings knowledge acquisition was tested in two different tasks: a semantic categorization test and the TOEFL test.

Semantic categorization test
In this test we measured the capabilities of the model to represent the semantic categories used by Patel et al. (Patel, Bullinaria, & Levy, 1997) (such as, drinks, countries, tools and clothes). The test is composed by 53 categories with 10 words each. In order to measure how well the word i is grouped vis-à-vis the other words in its semantic category we used the Silhouette Coefficients, s(i) (Rousseeuw, 1987), where a(i) is the mean distance of word i with all other words within the same category, and b(i) is the minimum mean distance of word i to any words within another category (i.e. the mean distance to the neighbouring category). In other words, Silhouette Coefficients measure how close is a word to its own category words compared to the closeness to neighbouring words. The Silhouette Score is computed as the mean value of all Silhouette Coefficients. The score takes values between -1 and 1, higher values reporting localized categories with larger distances between categories, representing better clustering. The high number of test words (530) and the high frequency of some of them leaves only a few document with no test words. This makes varied corpus size range in the out-of-domain documents discarding very small. To avoid this, we tested only on the 10 least frequent categories. The frequency of a question is measured as the number of documents in which at least one word from this category appears.

TOEFL test
The TOEFL test was introduced by Landauer and Dumais (1997) to evaluate the quality of semantic representations. This test consists of 80 multiple-choice questions, in which it is requested to identify the synonym of a target word between 4 options. For example: select the most semantically similar to "enormously" between this words: "tremendously", "appropriately", "uniquely" and "decidedly". The performance of this test was measured by the percentage of correct responses.
Again, The high number of test words (400) and the high frequency of some of them leaves few documents with no test words. So we performed the test only on the 20 least frequent questions in order to have out-of-domain documents to discard.

Semantic categorization Test
In Figure 1 we show the LSA (top panel) and Word2vec (bottom panel) categorization performance with both documents discarding methods. For each corpus size and document discarding method we took 10 subcorpus samples (in total we consider 90 subcorpus + the complete corpus). In each corpus/subcorpus we trained LSA and Word2vec with a wide range of dimension values, using in each case the dimension that produces the best mean performance.
In both cases, performance decreases when documents are randomly discarded (dashed lines). However LSA and Word2vec have different behavior in the out-of-domain document discarding method (solid lines). While LSA produces better scores with increasing specificity, the word2vec performance decreases in the same situation.
LSA's maximum performance is obtained using 20 dimensions and removing all outof-domain documents in the training corpus. While, when all the corpus is used the best number of dimensions is 100. These results show that performance for a specific task may be increased by "cleaning" the training corpus of out-of-domain documents. But, in order to enhance the performance, the elimination of out-of-domain documents should be accompanied by a decrease of the number of LSA dimensions. For example, fixing the number of dimensions to 100 the performance result in a reduction of 55%. We also point out that this technical subtlety has not been taken into account in previous results that reported the presence of indirect learning in LSA (Landauer & Dumais, 1997;Lemaire & Denhiere, 2006). Additionally, the need of decreasing the LSA dimensionality when the corpus size is reduced only occurs in the out-of-domain documents discarding method (see Figure  3 in Supplementary Materials). This result is consistent with LSA's ability to capture latent semantic domains. Unlike random discarding method, out-of-domain documents discarding strongly reduces the topics variety, thus less dimensions are needed to identify the words categories. In contrast, Word2vec do not present a shift in its maximums in the out-of-domain documents discarding method (see Figure 4 in Supplementary Materials). Moreover, Word2vec is little sensitive to changes in its dimensionality. These finding suggest that Word2vec do not encode latent semantic domains, however more analysis must be done in these direction (see (Baroni et al., 2014) discussion).

TOEFL Test
In Figure2 we show the TOEFL correct answer fraction vs the corpus size. We varied the corpus size by both methods: the out-of-domain documents discarding and the Random document discarding. As in the categorization test procedure, a wide range of dimension values where tested, using in each case the dimension that produces the best mean performance.
In both models, performance decreases when documents are randomly discarded (dashed lines in figure 2). For LSA, the elimination of out-of-domain documents does not produce a significant performance variation, which shows that LSA can not take advantage of out-of-domain document. This results are in contradiction with Landauer and Dumais (1997) observation of indirect learning. We believe that this difference is due to the lack of adjustment in the number of dimensions. On the other hand, Word2vec has the same behaviour as in the categorization test. The performance when the out-of-domain documents are discarded show a small downward trend (not significant, with p-val=0.31 in a two-sided Kolmogorov-Smirnov test), but not as pronounced as in random document discard method. Unlike the categorization test, the performance measure in the TOEFL Test present a high variability. This observation is consistent with the large fluctuations shown in Landauer and Dumais (1997). Despite this, we consider it relevant to use this test to be able to compare with the results obtained by Landauer and Dumais (1997).

Conclusion and Discussion
Despite the popularity of word-embeddings in several semantic representation task, the way in which they acquire new semantic relations between words is unclear. In particular, for the case of LSA there are two opposite visions about the effect of incorporating out-of-domain documents. From one point of view, training LSA with a specific subcorpus, cleaned of documents unrelated to the specific task increases the performance (Stone et al., 2006). From the other point of view, the presence of unrelated documents improves the representations. The second view point is supported by the conception that the SVD in LSA can capture high-order co-occurrence words relations (Landauer & Dumais, 1997;Lemaire & Denhiere, 2006;Turney & Pantel, 2010). Based on this, LSA is used as a plausible model of human semantic memory given that it can capture indirect relations (high-order word co-occurrences).
In the present article we studied the effect of out-of-domain documents in LSA and Word2vec semantic representations construction. We compared two ways of progressive elimination of documents: the elimination of random documents vs the elimination of out-of-domain documents. The semantic representations quality was measured in two different tasks: a semantic categorization test and a TOEFL exam. Additionally, we have varied a large range of word-embedding dimensions (k ).  , 10, 20, 50, 100, 300, 500, 1000} for LSA and among {5, 10, 20, 50, 100, 300, 500} for Word2vec. Due to the high computational effort, in the case of Word2vec we avoid using 1000 dimensions.
We have shown that Word2vec can take advantage of all the documents, obtaining its best performance when it is trained with the whole corpus. On the contrary, LSA's word-representation quality increases with a specialization of the training corpus (removal of out-of-domain document) accompanied by a decrease of k. Furthermore, we have shown that the specialization without the decrease of k can produce a strong performance reduction. Thus, we point out the need to vary k when the corpus size dependency is studied. From a cognitive modeling point of view, we point out that LSA's word-knowledge acquisitions does not take advantage of indirect learning (high-order word co-occurrences), while word2vec does. This throws light upon word-embeddings capabilities and limitations in modeling human cognitive tasks, such as: human word-learning (Landauer, 2007;Landauer & Dumais, 1997;Lemaire & Denhiere, 2006), semantic memory (Denhière & Lemaire, 2004;Kintsch & Mangalath, 2011;Landauer, 2007) and words classification (Laham, 1997). In figure 3 can be seen that performance decreases when documents are randomly discarded (bottom panels). However the dependency with out-of-domain documents (top panels) varied with the number of dimensions. In the cases of 300, 500 and 100 dimensions, the performance decreases when out-of-domain documents are eliminated. In contrast, we obtain the opposite behavior in the cases of 5, 10, 20, 50, 100 dimensions, in which the elimination of out-of-domain documents increases LSA's categorization performance.

A Supplementary Materials: Latent Semantic Domain
Consider the case when k is fixed in the value that maximizes the performance with the entire corpus (around k = 100). When the corpus is "cleaned" of out-of-domain documents, the remaining corpus will have not only fewer documents, but also less topic diversity between texts. Thus, the number of dimensions (k ) needed to generate a proper semantic representation should be reduced. As k is fixed in high values, LSA may not eliminate enough noisy dimensions, leading to a decrease in the performance. This effect becomes larger when the selected k is higher, as it can be seen for k = 300. On the other hand, consider the case when k is fixed in the value that maximizes the performance with the "cleaned" corpus (around k = 20). The presence of out-of-domain documents in the complete corpus increase the topic diversity. As k is fixed in low values, the LSA will not have enough dimensions to represent all the intrinsic complexity of the whole corpus. So, when the corpus is "cleaned" of out-of-domain documents, the performance should increase.
On the other hand, Word2vec present a performance decrease, with almost all dimension values, when out-of-domain documents are eliminated. Moreover, the discarding of out-of-domain documents do not require a considerable decrease of the number of LSA dimensions. These finding suggest that Word2vec do not encode latent semantic domains, however more analysis must be done in these direction.
Unlike the categorization test, the performance measure in the TOEFL Test present a high variability. This observation is consistent with the large fluctuations shown in Landauer and Dumais (1997). Despite this, we consider it relevant to use this test to be able to compare with the results obtained by Landauer and Dumais (1997).