Right-truncatable Neural Word Embeddings

This paper proposes an incremental learning strategy for neural word embedding methods, such as SkipGrams and Global Vectors. Since our method iteratively generates embedding vectors one dimension at a time, obtained vectors equip a unique property. Namely, any right-truncated vector matches the solution of the corresponding lower-dimensional embedding. Therefore, a single embedding vector can manage a wide range of dimensional requirements imposed by many different uses and applications.

The main purpose of this paper is to further enhance the 'usability' of obtained embedding vectors in actual use. To briefly explain our motivation, we first introduce the following concept: Definition 1 (D -right-truncated vector 1 ). Let w and w be vectors, whose dimensions are D and D , respectively. Namely, w = (w 1 , . . . , w D ) and w = (w 1 , . . . , w D ). Suppose w matches the concatenation of w and w , that is, w = (w 1 , . . . , w D , w 1 , . . . , w D ). Then, we define w as a D -right-truncated vector of w.
This paper focuses on the fact that the appropriate dimension of embedding vectors strongly depends on applications and uses, and is basically determined based on the performance and memory space (or calculation speed) trade-off. Indeed, the actual dimensions of the previous studies listed above are diverse; often around 50, and at most 1000. It is worth noting here that each dimension of embedding vectors obtained by conventional methods has no interpretable meaning. Thus, we basically need to retrain D -dimensional embedding vectors even if we already have a well-trained D-dimensional vector. In addition, we cannot take full advantage of freely available high-quality pre-trained embedding vectors 2 since their dimensions are already given and fixed, i.e., D = 300.
To reduce the additional computational cost of the retraining, and to improve the 'usability' of embedding vectors, we propose a framework for incrementally determining embeddings one dimension at a time from 1 to D. As a result, our method always offers the relation that 'any D -right-truncated em-bedding vector is the solution for D -dimensional embeddings of our method'. Therefore, in actual use, we only need to construct a relatively higherdimensional embedding vector 'just once', i.e., D = 1000, and then truncate it to an appropriate dimension for the application.

Neural Word Embedding Methods
Let U and V be two sets of predefined vocabularies of possible inputs and outputs. Let |U| and |V| be the number of words in U and V, respectively. Then, neural word embedding methods generally assign a D-dimensional vector to each word in U and V. We denote e i as representing the i-th input vector, and o j for the j-th output vector. In the rest of this paper, for convenience the notation 'i' is always used as the index of input vectors, and 'j' as the index of output vectors, where 1 ≤ i ≤ |U| and 1 ≤ j ≤ |V|.
We introduce E and O that represent lists of all input and output vectors, respectively. Namely, E = (e 1 , · · · , e |U| ) and O = (o 1 , · · · , o |V| ). X represents training data. Then, embedding vectors are obtained by solving the following form of a minimization problem defined in each neural word embedding method: where Ψ represents the objective function, andÊ andÔ are lists of solution embedding vectors.
Hereafter, we use Ψ as an abbreviation of Ψ(E, O | X ). For example, the objective function Ψ of 'SkipGram with negative sampling (SGNS)' can be written in the following form 3 : where x i,j = e i · o j , and L(x) represents a logistic loss function, namely, L(x) = log(1 + exp(−x)). Moreover, c i,j and c i,j represent co-occurrences of the i-th input and j-th output words in training data and negative sampling data, respectively. Another example, the objective function Ψ of the 'Global Vector (GloVe)' can be written in the fol-Input: X : training data, D: maximum number of dimensions (iterations) 1: lowing form (Pennington et al., 2014): where m i,j and β i,j represent certain co-occurrence and weighting factors of the i-th input and the jth output words, respectively. For example, (Pennington et al., 2014), where x max and γ are tunable hyper-parameters.

Incremental Construction of Embedding
This section explains our proposed method. The basic idea is very simple and clear: we convert the minimization problem shown in Eq. 1 to a series of minimization problems, each of whose individual problem determines one additional dimension of each embedding vector. We refer to this formulation of embedding problems as 'ITerative Additional Coordinate Optimization (ITACO)' formulation. Fig. 1 shows our entire optimization algorithm for this formulation.

Bias terms and optimization variables
Suppose d represents a discrete time step, where d ∈ {1, . . . , D}. Let B (d) be a matrix representation of bias terms at the d-th time step, and b i,j for all (i, j) and d have the following recursive relation: where we define b . This relation implies that the solutions of former optimizations are used as bias terms in latter optimizations.
Next, we defineq d andr d as the vector representations of the concatenation of all the input and output parameters at the d-th step, respectively, that is, Note that e i used in the former part of this paper is a D-dimensional vector whileq d andr d defined here are |U|-dimensional and |V|-dimensional vectors, respectively. Moreover, there are relations that e i,d is the d-th factor of e i , and, at the same time, the i-th factor ofq d . Fig. 2 illustrates the relation of e i andq d in this paper. We omit to explicitly show the relation of o j andr d , which are used to represent output vectors because of the space reason. However obviously, they also have the same relation as e i andq d .

Individual optimization problem
Then, we define the d-th optimization problem in our ITACO formulation as follows: where || · || p represents the L p -norm. We generally assume that p = {1, 2, ∞}, and often select p = 2. Note thatq d is optimization parameters in the d-th optimization problem while B (d−1) is the constant. Fig. 3 illustrates the relation of B (d−1) andq d .
We assume that the objective functionΨ takes an identical form as used in one of the conventional methods such as SGNS and GloVe as shown by Eqs. 2 and 3. The difference appears in the variables; our ITACO formulation uses x i,j = e i o j + b i,j rather than x i,j = e i ·o j as described in Sec. 2.

Improving stability of embeddings
The additional norm constraint in Eq. 5 is introduced to improve stability. The optimization problems of neural word embedding methods including SGNS and GloVe can be categorized as a bi-convex optimization problem (Gorski et al., 2007); they are convex with respect to the parameters E if the parameters O are assumed to be constants, and vice versa. One well-known drawback of unconstrained bi-convex optimization is that the optimization parameters can possibly diverge to ±∞ (See Example 4.3 in (Gorski et al., 2007)). This is because the objective function only cares about the inner product value of two vectors. Therefore, each parameter can easily have a much larger value, i.e., o 1 = 10 9 , if e 1 is smaller and approaches a zero value i.e., e 1 = 10 −10 . This is mainly caused by inconsistent scale problem. Thus, our norm constraint in Eq. 5 can eliminate this problem by maintaining the scale ofq andr at the same level.

Optimization algorithm
To solve Eq. 5, we employ the idea of the 'Alternating Convex Optimization (ACO)' algorithm (Gorski et al., 2007). ACO and its variants have been widely developed in the context of (non-negative) matrix factorization, i.e., (Kim et al., 2014), and are empirically known to be an efficient method in practice. The main idea of ACO is that it iteratively and al-ternatively updates one parameter set, i.e.,q, while the other distinct parameter set is fixed, i.e.,r. In our case, ACO solves the following two optimization problems iteratively and alternately: There are at least two advantages of using ACO; (1) Eqs. 6 and 7 both become convex optimization problems. Therefore, the global optimum solution can be obtained when ∂ e iΨ = 0 for all i and ∂ o jΨ = 0 for all j, respectively.
(2) ACO guarantees to converge to a stationary point (one of the local minima) 4 . For example, by a simple reformulation of ∂ e iΨ = 0, we obtain the closed form solution of Eq. 6 with the GloVe objective, that is, Similarly, the closed form solution of Eq. 7 is: Thus, we can solve Eqs. 6 and 7 without performing iterative estimation. Next, we obtain the following equation by a simple reformulation of ∂ e iΨ = 0 for the SGNS objective: where σ(x) represents a sigmoid function, that is, . Similarly, we also obtain the following form of the equation for Eq. 7: (11) These equations are efficiently solvable by a simple binary search procedure since each equation only has a single parameter, that is, e i or o j , During the optimization, there is no guarantee that the constraint |V| |U | ||q|| p = ||r|| p always holds. Fortunately, the following transformations always satisfy Input: X : training data, B: matrix form of bias terms, : constant for convergence check 1:q ← 1, andr ← 0 2: repeat 3:r ← updateIVec1D(r |q, B) // Eq. 11 or 9 4: (q,r) ← scaleVec(q,r) // Eq. 12 5:q ← updateOVec1D(q |r, B) // Eq. 10 or 8 6: (q,r) ← scaleVec(q,r) // Eq. 12 7: until ConvergenceCheck( ) Output: (q,r) Figure 4: Procedure of updateParams1D in Fig. 1 using the ACO-based algorithm. this norm constraint: which also maintainẽ iõj = e i o j , and the objective value. Thus, we can safely apply them at any time during the optimization.
Finally, Fig. 4 shows the optimization procedure when using the ACO framework.

Experiments
As in previously reported neural word embedding papers, our training data was taken from a Wikipedia dump (Aug. 2014). We used hyperwords tool 5 for our data preparation (Levy et al., 2015).
We compared our method, ITACO, with the widely used conventional methods, SGNS and GloVe. We used the word2vec implementation 6 to obtain word embeddings of SGNS, and glove implementation 7 for GloVe. Many tunable hyper-parameters were selected based on the recommended default values of each implementation, or suggestion explained in (Levy et al., 2015). For ITACO, we selected the Glove objective to solve Eqs. 6 and 7 since it requires a lower calculation cost than the SGNS objective.
We prepared three types of linguistic benchmark tasks, namely word similarity estimation (Similarity), word analogy estimation (Analogy), and sentence completion (SentComp) tasks. We gathered nine datasets for Similarity (Rubenstein and Goodenough, 1965;Miller and Charles, 1991;Agirre et al., 2009;Agirre et al., 2009;Bruni et al., 2014;Radinsky et al., 2011;Huang et al., 2012;Luong et al., 2013;Hill et al., 2014), three for Analogy (Mikolov et al., 2013a;Mikolov et al., 2013c) , and one for SentComp (Mikolov et al., 2013a). Table 1 shows all the results of our experiments 8 . The rows labeled '(trunc)' show the performance of D-right-truncated embedding vectors, whose original vector of dimension is D = 1000. Thus, they were obtained from a single set of embedding vectors with D = 1000 for each corresponding method. Next, the rows labeled '(retrain)' show the performance provided by SGNS or GloVe that were independently constructed with using a standard setting and corresponding D. Note that the results of 'ITACO (retrain)' are identical to those of 'ITACO (trunc)'. Moreover, 'GloVe (trunc)' and 'GloVe (retrain)' in D = 1000 are equivalent, as are 'SGNS (trunc)' and 'SGNS (retrain)'. Thus, these results 8 Results for SGNS and GloVe are the average performance of ten runs as suggested in (Suzuki and Nagata, 2015) were omitted from the table.
First, comparing '(retrain)' and '(trunc)' in SGNS and GloVe, our experimental results first explicitly revealed that SGNS and GloVe with the simple truncation approach '(trunc)' cannot provide effective lower-dimensional embedding vectors. This observation strongly supports the significance of existence of our proposed method, ITACO.
Second, in most cases ITACO successfully provided almost the same performance level as the best SGNS and GloVe (retrain) results. We emphasize that ITACO constructed embedding vectors 'just once', while SGNS and GloVe required us to retrain embedding vectors in the corresponding times. In addition, single run of ITACO for D = 1000 took approximately 12,000 seconds in our machine environment, which was almost equivalent to run 4 iterations of SGNS and 8 iterations of GloVe. The results of SGNS and GloVe in Table 1 were obtained by 10 iterations and 20 iterations, respectively, which are one of the standard settings to run SGNS and GloVe 9 . This fact verified that ITACO can run efficiently as in the same level as SGNS and GloVe.

Conclusion
This paper proposed a method for generating interesting right-truncatable word embedding vectors. Our experiments revealed that the embedding vectors obtained with our method, ITACO, in any lower dimensions work as well as those obtained by SGNS and Glove. In addition, ITACO can also be a good alternative of SGNS and GloVe in terms of the execution speed of a single run. Now, we are free from retraining different dimensions of embedding vectors by using ITACO. Our method significantly reduces the total calculation cost and storage, which improves the 'usability' of embedding vectors 10 .