K-Embeddings: Learning Conceptual Embeddings for Words using Context

We describe a technique for adding contextual distinctions to word embeddings by extending the usual embedding process — into two phases. The ﬁrst phase resembles existing methods, but also constructs K classiﬁcations of concepts. The second phase uses these clas-siﬁcations in developing reﬁned K embed-dings for words, namely word K -embeddings. The technique is iterative, scalable, and can be combined with other methods (including Word2Vec) in achieving still more expressive representations. Experimental results show consistently large performance gains on a Semantic-Syntactic Word Relationship test set for different K settings. For example, an overall gain of 20% is recorded at K = 5 . In addition, we demonstrate that an iterative process can further tune the embeddings and gain an extra 1% ( K = 10 in 3 iterations) on the same benchmark. The examples also show that polysemous concepts are meaningfully embedded in our K different conceptual embeddings for words.


Introduction
Neural-based word embeddings are vectorial representations of words in high dimensional real valued space. Success with these representations have resulted in their being considered for an increasing range of natural language processing (NLP) tasks. Recent advances in word embeddings have shown great effects that are pushing forward state-of-the-art results in NLP (Koo et al., 2008;Turian et al., 2010;Collobert et al., 2011;Yu et al., 2013;Mikolov et al., 2013a;Mikolov et al., 2013b;Mikolov et al., 2013c). Embedding learning models for words are also being adapted for tasks in other research fields (Reinanda et al., 2015;Vu and Parker, 2015). The Continuous bag of words (CBOW) and Skip-gram (Mikolov et al., 2013a) are currently considered as state-of-theart in learning algorithms for word embeddings.
The ability of words to assume different roles (syntax) or meanings (semantics) presents a basic challenge to the notion of word embedding (Erk and Padó, 2008;Reisinger and Mooney, 2010;Huang et al., 2012;Tian et al., 2014;Neelakantan et al., 2014;Chen et al., 2015). External resources and features are introduced to address this challenge. In general, individuals with no linguistic background can generally resolve these differences without difficulty. For example, they can distinguish "bank" as referring to a riverside or a financial establishment without semantic or syntactic analysis.
Distinctions of role and meaning often follow from context. The idea of exploiting context in linguistics was introduced with a distributional hypothesis: "linguistic items with similar distributions have similar meanings" (Harris, 1954). Firth soon afterwards emphasized this in a famous quote: "a word is characterized by the company it keeps" (1957).
We propose to exploit only context information to distinguish different concepts behind words in this paper. The contribution of this paper is to note that a two-phase word embedding training can be helpful in adding contextual information to existing embedding methods: • we use learned context embeddings to effi-ciently cluster word contexts into K classifications of concepts, independent of the word embeddings.
• this approach can complement existing sophisticated, linguistically-based features, and can be used with word embeddings to achieve gains in performance by considering contextual distinctions for words.
• two-phase word embedding may have other applications as well, conceivably permitting some 'non-linear' refinements of linear embeddings.
In the next section we present our learning strategy for word K-embeddings, outlining how the value of K affects its power in increasing syntactic and semantic distinctions. Following this, a largescale experiment serves to validate the idea -from several different perspectives. Finally, we offer conclusions about how adding contextual distinctions to word embeddings (with our second phase of embedding) can gain power in distinguishing among different aspects of words.

Learning Word K-Embeddings
The use of multiple semantic representations for a word in resolving polysemy has a significant literature (Erk and Padó, 2008;Reisinger and Mooney, 2010;Huang et al., 2012;Tian et al., 2014;Neelakantan et al., 2014;Chen et al., 2015). Strategies often focus on discrimination using syntactic and semantic information.
We investigate another direction -the extension of the word embedding process into a second phase -which allows context information to be consolidated with the embedding. Rather than annotating words with features, our technique treats context as second-order in nature, suggesting an additional representation step.
Our learning strategy for word K-embeddings is therefore done, possibly iteratively, in two phases: 1. Annotating words with concepts (defined by their contextual clusters) 2. Training embeddings using the resulting annotated text.

Concept Annotation using Context Embeddings
We propose to annotate words with concepts given by learned context embeddings, which are an underutilized output of word embedding training. Our strategy is based on the assumption that the context of a word is useful for discriminating its conceptual alternatives in polysemy. In general, our concept annotation for words is performed in two steps -clustering of context embeddings followed by annotation. Specifically, we first employ a clustering algorithm to cluster the context embeddings. K-means is our algorithm of choice. The clustering algorithm will assign each context word to a distinct cluster. This result is then used to re-assign words in training data to their contextual cluster.
Second, we annotate words in the training data with their most common contextual cluster (of their context words). We define context words to mean the surrounding words of a given word. Formally, a word is annotated with a concept given by the following function: Here W is the set of context words of the current word, and f (c i , c j ) is a boolean function whose output is 1 if the input parameters are equal: The cluster-annotated dataset is then passed into the next training phase.

Training Word K-Embeddings
The second phase is similar to existing word embedding training systems. The number of clusters K defines the maximum number of different representations for words. Table 1 presents the statistics for different selections of K using the dataset mentioned in the Experiments section.
Each value K in Table 1 is shown with the total number of embeddings and vocabulary size. Words in the vocabulary can have up to K different embeddings for different annotated concepts. As K increases, the size of the vocabulary decreases - yet remains largely stable for different values of K greater than 1. This is explained by the count of words being scattered to different concepts, resulting in a lower word count per concept. In our setting, concept-annotated words with fewer than 5 occurrences will be discarded during training of word embeddings.
It is interesting to note that the total number of embeddings is broadly stable and less affected by K. For example, as we allow up to 10 different concepts for a word (K = 10), the total number of embeddings grows only slightly compared to the result for K = 1. The average number of embeddings for a word is 1.86 for K = 10. In other words, concept annotations do converge as we increase K.

Word K-Embedding Training Workflow
Figure 1 presents our proposed workflow to train context-based conceptual word K-embeddings. Our system allows each word to have at most K different embeddings, where each is a representation for a certain concept.
The input to the workflow is a large-scale text dataset. Initially, we compute context embeddings for words as presented previously. We can derive context embeddings directly from the training of almost any context-based word embeddings, where word embeddings are computed via their context words.
Subsequently, we cluster context embeddings into groups which reflect varied concepts on some semantic vector space. Each context embedding is assigned to a cluster denoting its conceptual role as a context word. Any clustering algorithm for vectors can be applied for this task.
Embeddings of annotated context words are used to compute concepts of words in a sentence. We hypothesize that the concept of a word is defined by the concept of its surrounding words. We annotate concepts for all words in the training data.
Finally, the concept-annotated training data is passed into any standard algorithm for training word embeddings for the conceptual word Kembeddings.

Settings
Our training data for word embeddings is Wikipedia for English, downloaded on November 29, 2014. It consists of 4,591,457 articles, with a total of 2,015,823,886 words. The dataset is pre-processed with sentence and word tokenization. We convert text to lower-case prior to training. We consider |W |= 5 for the size of the context window W presented in Section 2.1.
We used the Semantic-Syntactic Word Relationship test set (Mikolov et al., 2013a) for our experimental studies. This dataset consists of 8,869 semantic and 10,675 syntactic queries. Each query is a tuple of four words (A, B, C, D) for the question "A is to B as C to what?". These queries can be either semantic or syntactic. D, to be predicted from the learned embeddings, is defined as the closest word to the vector (A − B + C). We used Word2Vec for training and scikit-learn for clustering tasks.
We evaluate the accuracy of the prediction of D in these queries. A query is considered hit if there exists at least one correct match and all the words are in the same concept group. This is based on the assumption that if "A is to B as C is to D", either (A, B) and (C, D) OR (A, C) and (B, D) have to be in the same concept group.

Results
The embeddings learned in phases 1 and 2 can be compared, using different values for K in the Kmeans clustering. Word relationship performance results are shown in Table 2.
Our proposed technique in phase 2 achieves consistently high performance. For example, when K = 5, our absolute performance is 89% and 81% in semantic and syntactic relationship evaluations, gaining 24% and 16% from the standard CBOW model (phase 1). When K = 25, the performance yields the best combined result. As shown in Table 1, the total number of embeddings and vocabulary size differ by a small multiplicative factor as K increases.
In another comparison, Figure 2 plots our K-Embeddings results versus the results of a relaxed evaluation for CBOW, which considers the top K embeddings instead of the best. Even though our evaluation is restricted to one-best for each of the K embeddings, the overall (combined) performance for different K settings is still consistently better than the top K embeddings of CBOW. Moreover, for a specific K setting, the total number of different embeddings considered in K-Embeddings is always less than that of the top K. For example, in our peak result (K = 25), the total number of embeddings considered in the evaluation set is only about 76.17% of the total embeddings with the top 25 of CBOW.
In addition, we also compare the performances of K-embeddings in multiple iterations under the same K setting in Table 3. It shows that the Kembeddings are improved after certain number of iterations. Particularly, for K = 10, we can achieve best performance after 3 to 4 iterations, gaining roughly 1%.
Finally, it is also worth noting that the performance does not always increase linearly with the number of embeddings or vocabulary size. This suggests that as we achieve better performance in K-  Table 3: Performance of K = 10 in five iterations embeddings, we should also gain more compact conceptual embeddings.

Word Expressivity Analysis
Expressivity of word groups for "mercury" and "fan" are studied in Table 4. The first two rows shows most related words of "mercury" and "fan" without concepts annotation (baseline). The following rows present our K-embeddings result. This table illustrates the differences that arise in multiple representations of a word, and shows semantic distinctions among these representations. For example, different representations for the word "mercury" indeed represent a spectrum of aspects for the word, ranging from related-cosmos, related chemical element, automobile, or even to music. The same can be seen for "fan" -where we find concepts related to fan as a follower/supporter, fan as in machinery, or Fan as a common Chinese surname. Indeed, we can find many different conceptual readings of these words. These not only reflect different polysemous meanings, but also their conceptual aspects in the real world. Observe that most related words are grouped into distinct concept groups, and thus yield strong semantic distinctions. The result firmly suggests that context embeddings, like word embeddings, can capture linguistic regularities efficiently.

Conclusion
In this paper, we have presented a technique for adding contextual distinctions to word embeddings with a second phase of embedding. This contextual information gains power in distinguishing among different aspects of words. Experimental results with embedding of the English variant of Wikipedia (over 2 billion words) shows significant improvements in both semantic-and syntactic-based word embedding performance. The result also presents a wide range of interesting concepts of words in expressivity analysis. These results strongly support the idea of using context embeddings to exploit context information for problems in NLP. As we highlighted earlier, context embeddings are underutilized, even though word embeddings have been extensively exploited in multiple applications.
Furthermore, the contextual approach can complement existing sophisticated, linguistically-based features, and can be combined with other learning methods for embedding. These results are encouraging; they suggest that useful extensions of current methods are possible with two-phase embeddings.