Online Learning of Task-specific Word Representations with a Joint Biconvex Passive-Aggressive Algorithm

This paper presents a new, efficient method for learning task-specific word vectors using a variant of the Passive-Aggressive algorithm. Specifically, this algorithm learns a word embedding matrix in tandem with the classifier parameters in an online fashion, solving a bi-convex constrained optimization at each iteration. We provide a theoretical analysis of this new algorithm in terms of regret bounds, and evaluate it on both synthetic data and NLP classification problems, including text classification and sentiment analysis. In the latter case, we compare various pre-trained word vectors to initialize our word embedding matrix, and show that the matrix learned by our algorithm vastly outperforms the initial matrix, with performance results comparable or above the state-of-the-art on these tasks.


Introduction
Recently, distributed word representations have become a crucial component of many natural language processing systems (Koo et al., 2008;Turian et al., 2010;Collobert et al., 2011). The main appeal of these word embeddings is twofold: they can be derived directly from raw text data in an unsupervised or weakly-supervised manner, and their latent dimensions condense interesting distributional information about the words, thus allowing for better generalization while also mitigating the presence of rare and unseen terms. While there are now many different spectral, probabilistic, and deep neural approaches for building vectorial word representations, there is still no clear understanding as to which syntactic and semantic information they really cap-ture and whether or how these representations really differ Levy and Goldberg, 2014b;Schnabel et al., 2015). Also poorly understood is the relation between the word representations and the particular learning algorithm (e.g., whether linear or non-linear) that uses them as input (Wang and Manning, 2013).
What seems clear, however, is that there is no single best embedding and that their impact is very much task-dependent. This in turn raises the question of how to learn word representations that are adapted to a particular task and learning objective. Three different research routes have been explored towards learning task-specific word embeddings. A first approach (Collobert et al., 2011;Maas et al., 2011) is to learn the embeddings for the target problem jointly with additional unlabeled or (weakly-)labeled data in a semi-supervised or multi-task approach. While very effective, this joint training typically requires large amounts of data and often prohibitive processing times in the case of multi-layer neural networks (not to mention their lack of theoretical learning guarantees in part due to their strong non-convexity). Another approach consists in training word vectors using some existing algorithm as in like word2vec (Mikolov et al., 2013) in a way that exploits prior domain knowledge (e.g., by defining more informative, task-specific contexts) (Bansal et al., 2014;Levy and Goldberg, 2014a). In this case, there is still a need for additional weakly-or handlabeled data, and there is no guarantee that the newly learned embeddings will indeed benefit the performance, as they are trained independently of the task objective. A third approach is to start with some existing pre-trained embeddings and fine-tune them to the task by integrating them in a joint learning objective either using backpropagation (Lebret et al., 2013) or regularized logistic regression (Labutov and Lipson, 2013). These approaches in effect hit a sweet spot by leveraging pre-trained embeddings and requiring no additional data or domain knowledge, while directly tying them to task learning objective.
Inspired by these latter approaches, we propose a new, online soft-margin classification algorithm, called Re-embedding Passive-Agressive (or RPA), that jointly learns an embedding matrix in tandem with the model parameters. As its name suggests, this algorithm generalizes the Passive-Aggressive (PA) algorithm of Crammer and Singer (2006) by allowing the data samples to be projected into the lower dimension space defined by the original embedding matrix. Our approach may be seen as extending the work of (Grandvalet and Canu, 2003) which addresses the problem of simultaneously learning the features and the weight vector of an SVM classifier. An important departure, beyond the online nature of RPA, is that it learns a projection matrix and not just a diagonal one (which is essentially what this earlier work does). Our approach and analysis are also related to (Blondel et al., 2014), which tackles non-negative matrix factorization with the PA philosophy.
The main contributions of this paper are as follows. First, we derive a new variant of the Passive-Aggressive algorithm able to jointly learn an embedding matrix along with the weight vector of the model (section 2). Second, we provide theoretical insights as to bound the cumulative squared loss of our learning procedure over any given sequence of examples-the results we give are actually related to a learning procedure that slighlty differs from the algorithm we introduce but that is more easily and compactly amenable to a theoretical study. Third, we further study the behavior of this algorithm on synthetic data (section 4) and we finally show that it performs well on five real-world NLP classification problems (section 5).

Algorithm
We consider the problem of learning a binary linear classification function f Φ,w , parametrized by both a weight vector w ∈ R k and an embedding matrix Φ ∈ R k×p (typically, with k p), which is defined as: We aim at an online learning scenario, wherein both w and Φ will be updated in a sequential fashion. Given a labeled data steam S = {(x t , y t )} T t=1 , it seems relevant at each step to solve the following soft-margin constrained optimization problem: where · 2 , · F stand for the l 2 and Frobenius norms, respectively, and C controls the "aggressiveness" of the update (as larger C values imply updates that are directly proportional to the incurred loss). We define t (w; Φ; x t ) as the hinge loss, that is: The optimization problem in (1) is reminiscent of the soft-margin Passive-Aggressive algorithm proposed in (Crammer et al., 2006) (specifically, PA-II), but both the objective and the constraint now include a term based on the embedding matrix Φ.
The λ regularization parameter in the objective controls the allowed divergence in the embedding parameters between iterations. Interestingly, the new objective remains convex, but the margin constraint doesn't as it involves a multiplicative term between the weight vector and the embedding matrix, making the overall problem bi-convex (Gorski et al., 2007). That is, the problem is convex in w for fixed values of Φ and convex in Φ for fixed values of w. Incidentally, the formulation presented by (Labutov and Lipson, 2013) is also bi-convex (as it also involves a similar multiplicative term), although the authors proceed as if it were jointly convex (i.e., convex in both w and Φ). In order to solve this problem, we resort to an alternating update procedure which updates each set of parameters (i.e., either the weight vector or the embedding matrix) while holding the other fixed until some stopping criterion is met (in our case, the value of the objective doesn't change). As shown in Algorithm 1, this procedure allows us to compute closed-form updates similar to those of PA, and to make use of the same theoretical apparatus for analyzing RPA.

Unified Formalization
Suppose from now on that C and λ are fixed and so are w t and Φ t : this will allow us to drop the explicit dependence on these values and to keep Algorithm 1 Re-embedding Passive-Aggressive Require: Update w t+1 and Φ t+1 : until some stopping criterion is met return w t , Φ t the notation light. Let Q t be defined as: and margin q t be defined as: Here ·, · F denote the Frobenius inner product. We have purposely provided the two equivalent forms (4a) and (4b) to emphasize the (syntactic) exchangeability of w and Φ. As we shall see, this is going to be essential to derive an alternating update procedure-which alternates the updates with respect to w and Φ-in a compact way. Given Q t and q t , we are now interested in solving the following optimization problem:

Bi-convexity
It turns out that problem (5) is a bi-convex optimization problem: it is indeed straightforward to observe that if Φ is fixed then the problem is convex in (w, ξ)-it is the classical passiveaggressive II optimization problem-and if w is fixed then the problem is convex in (Φ, ξ). If there exist theoretical results on the solving of biconvex optimization problems, the machinery to use pertains to combinatorial optimization, which might be too expensive. In addition, computing the solution of (5) would drive us away from the spirit of passive-aggressive learning which rely on cheap and statistically meaningful (from the mistake-bound perspective) updates. This is the reason why we propose to resort to an alternate online procedure to solve a proxy of (5).

An Alternating Online Procedure
Instead of tackling problem (5) directly we propose to solve, at each time t, either of: This means that the optimization is performed with either Φ fixed to Φ t or w fixed to w t . Informally, the iterative Algorithm 1, resulting from this alternate scheme, will solve at each round a simple constrained optimization problem, in which the objective is to minimize the squared Euclidean distances between the new weight vector (resp. the new embedding matrix) and the current one, while making sure that both sets of parameters achieve a correct prediction with a sufficiently high margin. Note that one may recover the standard passive-aggressive algorithm by simply fixing the embedding matrix to the identity matrix. Also note that if the right stopping criteria are retained, Algorithm 1 is guaranteed to converge to a local optimum of (5) (see (Gorski et al., 2007)).
When fully developed, problems (6)-(7) respectively write as: Using the equivalence in (4), both problems can be seen as special instances of the generic problem: where · and ·, · are generalized versions of the l 2 norm and the inner product, respectively. This is a convex optimization problem that can be readily solved using classical tools from convex optimization to give: which comes from the following proposition.
Proof. The Lagrangian associated with the problem is given by: with τ > 0. Necessary conditions for optimality are ∇ u L = 0, and ∇ ξ L = 0, which imply u = u t + τ λ v t , and ξ = τ 2C . Using this in (10) gives the function: which is maximized with respect to τ when g (τ ) = 0, i.e.: Taking into account the constraint τ ≥ 0, the maximum of g is attained atτ with: which, setting τ u =τ /λ gives (9).
The previous proposition allows us to readily have the solutions of (6) and (7) as follows.
Remark 1 (Hard-margin case). Note that the hard-margin version of the previous problem: is degenerate from the alternated optimization point of view. It suffices to observe that the updates entailed by the hard-margin problem correspond to (9) with C set to ∞; if it happens that either Φ 0 = 0 or w 0 = 0, then one of the optimization problems (wrt to w or Φ) has no solution.

Analysis
Using the same technical tools as in (Crammer et al., 2006), and the unified formalization of section 2, we have the following result. Proposition 3. Suppose that problem (8) is iteratively solved for a sequence of vectors v 1 , . . . , v T to give u 2 , . . . , u T +1 , u 1 being given. Suppose that, at each time step, v t ≤ R, for some R > 0. Let u * be an arbitrary vector living in the same space as u 1 . The following result holds where This proposition and the accompanying lemmas are simply a variation of the results of (Crammer et al., 2006), with the addition that they are based on the generic problem (8). The loss bound applies for a version of Algorithm 1 where one of the parameters, either the weight vector or the reembedding matrix, is kept fixed for some time.
Proposition 3 makes use of the following lemma (see Lemma 1 in (Crammer et al., 2006)): Lemma 1. Suppose that problem (8) is iteratively solved for a sequence of vectors v 1 , . . . , v T to give u 2 , . . . , u T +1 , u 1 being given. The following holds: Proof. As in (Crammer et al., 2006), simply set ∆ t = u t − u * 2 − u t+1 − u * 2 and bound T τ =1 ∆ t from above and below. For the upper bound: For the lower bound, we focus on the nontrivial situation where t > 0, otherwise ∆ t = 0 (i.e. no update is made) and the bounding is straightforward. Making use of the value of τ u : Since t > 0, then u t , v t = 1 − t ; also, by the definition of * t (see (16) hence the targetted lower bound and, in turn, (17).
Proof of Proposition 3. As for the proof of Proposition 3, it suffices, again, to follow the steps given in (Crammer et al., 2006), but this time, for the proof of Theorem 5. Note that for any β = 0, (βτ u − * t /β) 2 is welldefined and nonnegative and thus: Setting β = λ/2C and using τ u = t /( v 2 + Figure 1: Accuracy rates for PA and RPA as a function of the variance of the Gaussian noise added to the "true" embedding Φ. The observed X matrix is n = 500 × p = 1000. λ 2C ) (see (9)) gives Using the assumption that v t ≤ R fo all t and rearranging terms concludes the proof.
The result given in Proposition 3 bounds the cumulative loss of the learning procedure when one of the parameters, either Φ or w, is fixed and the other is the optimization variable. Therefore, it does not directly capture the behavior of Algorithm 1, which alternates between the updates of Φ and w. A proper analysis of Algorithm 1 would require a refinement of Lemma 1 which, to our understanding, would be the core of a new result. This is a problem we intend to put our energy on in the near future, as an extension to this work.

Experiments on Synthetic Data
In order to better understand and validate the RPA algorithm, we first conducted some synthetic experiments. Specifically, we simulated a highdimensional matrix X ∈ R n×p , with n data samples realizing p "words" and p n, using the following generative model: That is, each p-dimensional data point x i was generated from a hidden lower k-dimensional z i and Figure 2: Hyper-plane learned by PA on XΦ (left pane) compared to hyper-plane and data representations learned by RPA (right pane). X is n = 200 × p = 500, and noise σ = 2. Training and test points appear as circles or plus marks, respectively. RPA's hyper-parameters are set to C = 100 and λ = .5.
an embedding matrix Φ ∈ R k×p , mapping k latent concepts to the p observed words. For simplicity, we assume that: (i) there are only two concepts (i.e., k = 2), (ii) each data point realizes a single concept (i.e., each x i is a p-dimensional indicator vector), (iii) each concept is equally represented in the data (with n/2 data points), and (iv) each concept deterministically signals a class label, either −1 or +1. Recovering Z and predicting the z i 's labels is trivial if one is given X and the true embedding Φ, so we added Gaussian noise ε i ∼ N (0, σ 2 I p ) to each φ i . The resulting, observed noisy matrix is denotedΦ.
Given this setting, the goal of the RPA is to learn a set of classification parameters w ∈ R k in the latent space and to "de-noise" the observed embedding matrixΦ by exploiting the labeled data. We are interested in comparing the RPA with a regular PA that directly learns from the noisy data XΦ . The outcome of this comparison is plotted in Figure 1. For this experiment, we randomly splitted the X data according to 80/10/10 for train/dev/test and considered increasing noise variance from 0 to 10 by increment of 0.1. Each dot in Figure 1 corresponds to the average accuracy over 10 seperate, random initializations of the embedding matrix at a particular noise level. Hyper-parameters were optimized using a grid search on the dev set for both the PA and RPA. 1 As shown in Figure 1, the PA's accuracy quickly drops to levels that are only slightly above chance, while the RPA manages to maintain an accuracy close to .7 even with large noise. This indicates that the RPA is able to recover some of the structure in the embedding matrix. This behavior is also illustrated in Figure 2, wherein the two hidden concepts appear in yellow and green. While the standard PA learns a very bad hyper-plane, which fails to separate the two concepts, the RPA learns a much better hyper-plane. Interestingly, most of the data points appear to have been projected on the margins of the hyper-plane.

Experiments on NLP tasks
This section assesses the effectiveness of RPA on several text classification tasks.

Evaluation Datasets
We consider five different classification tasks which are concisely summarized in Table 1. 20 Newsgroups Our first three text classification tasks from this dataset 2 consists in categorizing documents into two related sub-topics: (i) Comp.: IBM vs. Mac, (ii) Religion: atheism vs. christian, and (iii) Sports: baseball vs. hockey.
IMDB Movie Review This movie dataset 3 was introduced by (Maas et al., 2011) for sentiment analysis, and contains 50, 000 reviews, divided into a balanced set of highly positive (7 stars out of 10 or more) and negative scores (4 stars or less).

TREC Question Classification
This dataset 4 (Li and Roth, 2002) involves six question types: abbreviation, description, entity, human, location, and number.   Table 2: Out-of-vocabulary rates for non-hapax words in each dataset-embedding pair.

Preprocessing and Document Vectors
All datasets were pre-preprocessed with the Stanford tokenizer 5 , except for the TREC corpus which comes pre-tokenized. Case was left intact unless used in conjunction with word embeddings that assume down-casing (see below). For constructing document or sentence vectors, we used a simple 0-1 bag-of-word model, simply summing over the word vectors of occurring tokens, followed by L2-normalization of the resulting vector in order to avoid document/sentence length effects. For each dataset, we restricted the vocabulary to non-hapax words (i.e., words occurring more than once). Words unknown to the embedding were mapped to zero vectors.

Initial Word Vector Representations
Five publicly available word vectors were used to define initial embedding matrices in the RPA. The coverage of the different embeddings wrt each dataset vocabulary is reported in Table 2.
CnW These word vectors were induced using the neural language model of (Collobert and Weston, 2008) re-implemented by (Turian et al., 2010). 6 They were trained on 63M word news corpus, covering 268, 810 word forms (intact case), with 50, 100 or 200 dimensions for each word.
HPCA (Lebret and Collobert, 2014) present a variant of Principal Component Analysis, Hellinger PCA, for learning spectral word vectors. These were trained over 1.62B words from 5 nlp.stanford.edu/software/tokenizer. shtml 6 metaoptimize.com/projects/wordreprs Wikipedia, RCV1, and WSJ with all words lowercased, and digits mapped to a special symbol. Vocabulary is restricted to words occurring 100 times or more (hence, a total of 178, 080 words). These come in 50, 100 and 200 dimensions. 7 GloVe These global word vectors are trained using a log-bilinear regression model on aggregated global word co-occurrence statistics (Pennington et al., 2014). We use two different releases: 8 (i) GV6B trained on Wikipedia 2014 and Gigaword 5 (amounting to 6B down-cased words and a vocabulary of 400k) with vectors in 50, 100, 200, or 300 dimensions, and (ii) GV840B trained over 840B uncased words (a 2.2M vocabulary) with vectors of length 300.
SkGr Finally, we use word vectors pre-trained with the skip-gram neural network model of (Mikolov et al., 2013): each word's Huffman code is fed to a log-linear classifier with a continuous projection layer that predicts context words within a specified window. The embeddings were trained on a 100B word corpus of Google news data (a 3M vocabulary) and are of length 300. 9 rand In addition to these pre-trained word representations, we also use random vectors of lengths 50, 100, and 200 as a baseline embedding. Specifically, each component of these word vector is uniformly distributed on the unit interval (−1, 1).
over the training data. For the alternating online procedure, we used the difference between the objective values from one iteration to the next for defining the stopping criterion, with maximum number of iterations of 50. In practice, we found that the search often converged much few iterations. The multi-class classifier used for the TREC dataset was obtained by training the RPA in simple One-versus-All fashion, thus learning one embedding matrix per class. 10 For datasets with label imbalance, we set a different C parameter for each class, re-weighting it in proportion to the inverse frequency of the class. Table 3 summarizes accuracy results for the RPA against those obtained by a PA trained with fixed pre-trained embeddings. The first thing to notice is that the RPA delivers massive accuracy improvements over the vanilla PA across datasets and embedding types and sizes, thus showing that the RPA is able to learn word representations that are better tailored to each problem. On average, accuracy gains are between 22% and 31% for CnW, HLBL, and HPCA. Sizable improvements, ranging from 8% and 18%, are also found for the better 10 More interesting configurations (e.g., a single embedding matrix shared across classes), are left for future work. performing GV6B, GV840B, and SkGr. Second, RPA is able to outperform on all five datasets the strong baseline provided by the one-hot version of PA trained on the original high-dimensional space, with some substantial gains on sports and trec.

Results
Overall, the best scores are obtained with the reembedded SkGr vectors, which yield the best average accuracy, and outperform all the other configurations on two of the five datasets (trec and imdb). GV6B (dimension 300) has the second best average scores, outperforming all the other configurations on sports. Interestingly, embeddings learned from random vectors achieve performance that are often on a par or higher than those given by HLBL, HPCA or CnW initializations. They actually yield the best performance for the two remaining datasets: comp and religion. On these tasks, RPA does not seem to benefit from the information contained in the pre-trained embeddings, or their coverage is not just good enough.
For both PA and RPA, performance appear to be positively correlated with embedding coverage: embeddings with lower OOV rates generally perform better those with more missing words. The correlation is only partial, since GV840B do not yield gains compared to GV6B and SkGr despite its better word coverage. Also, SkGr largely outperforms HPCA although they have similar OOV  rates. As for dimensions, embeddings of length 100 and more perform the best, although they involve estimating larger number of parameters, which is a priori difficult given the small sizes of the datasets. By comparison with previous work on imdb, the RPA performance are substantially better than those reported by (Labutov and Lipson, 2013), whose best re-embedding score is 81.15 with CnW. By comparison, our best score with CnW is 86.64, and 88.52 with SkGr, thus closing the gap on (Maas et al., 2011) who report an accuracy of 88.89 using a much more computationally intensive approach specifically tailored to sentiment analysis. Interestingly, (Labutov and Lipson, 2013) show that accuracy can be further improved by concatenating re-embedded and 1-hot representations. This option is also available to us, but we leave it to future work.
Finally, Table 4 report accuracy results for RPA against PA when both algorithms are trained in a genuine online mode, that is with a single pass over the data. As expected, performance drop for the RPA and the PA, but the decreases are comparatively much smaller for the RPA (from 2% to 3%) compared to the PA (from 0.4% to 14%).

Conclusion and Future Work
In this paper, we have proposed a new scalable algorithm for learning word representations that are specifically tailored to a classification objective. This algorithm generalizes the well-known Passive-Aggressive algorithm, and we showed how to extend the regret bounds results of the PA to the RPA when either the weight vector or the embedding matrix is fixed. In addition, we have also provided synthetic and NLP experiments, demonstrating that the good classification performance of RPA.
In future work, we first would like to achieve a more complete analysis of the RPA algorithm when both w and Φ both get updated. Also, we intend to investigate potential exact methods for solving biconvex minimization (Floudas and Viswewaran, 1990), as well as to develop a stochastic version of RPA, thus foregoing running the inner alternate search to convergence. More empirical perspectives include extending the RPA to linguistic structured prediction tasks, better handling of unknown words, and a deeper intrinsic and statistical evaluation of the embeddings learned by the RPA.