Deep Generative Model for Joint Alignment and Word Representation

This work exploits translation data as a source of semantically relevant learning signal for models of word representation. In particular, we exploit equivalence through translation as a form of distributed context and jointly learn how to embed and align with a deep generative model. Our EmbedAlign model embeds words in their complete observed context and learns by marginalisation of latent lexical alignments. Besides, it embeds words as posterior probability densities, rather than point estimates, which allows us to compare words in context using a measure of overlap between distributions (e.g. KL divergence). We investigate our model's performance on a range of lexical semantics tasks achieving competitive results on several standard benchmarks including natural language inference, paraphrasing, and text similarity.


Introduction
Natural language processing applications often count on the availability of word representations trained on large textual data as a means to alleviate problems such as data sparsity and lack of linguistic resources (Collobert et al., 2011;Socher et al., 2011;Tu et al., 2017;Bowman et al., 2015).
Traditional approaches to inducing word representations circumvent the need for explicit semantic annotation by capitalising on some form of indirect semantic supervision. A typical example is to fit a binary classifier to detect whether or not a target word is likely to co-occur with neighbouring words (Mikolov et al., 2013). If the binary classifier represents a word as a continuous vector, that vector will be trained to be discriminative of the contexts it co-occurs with, and thus words in similar contexts will have similar representations.
The underlying assumption is that context (e.g. neighbouring words) stands for the meaning of the target word (Harris, 1954;Firth, 1957). The success of this distributional hypothesis hinges on the definition of context and different models are based on different definitions. Importantly, the nature of the context determines the range of linguistic properties the representations may capture (Levy and Goldberg, 2014b). For example, Levy and Goldberg (2014a) propose to use syntactic context derived from dependency parses. They show that their representations are much more discriminative of syntactic function than models based on immediate neighbourhood (Mikolov et al., 2013).
In this work, we take lexical translation as indirect semantic supervision (Diab and Resnik, 2002). Effectively we make two assumptions. First, that every word has a foreign equivalent that stands for its meaning. Second, that we can find this equivalent in translation data through lexical alignments. 1 For that we induce both a latent mapping between words in a bilingual sentence pair and distributions over latent word representations.
To summarise our contributions: • we model a joint distribution over sentence pairs that generates data from latent word representations and latent lexical alignments; • we embed words in context mining positive correlations from translation data; • we find that foreign observations are necessary for generative training, but test time predictions can be made monolingually; • we apply our model to a range of semantic natural language processing tasks showing its usefulness.
x y z a θ m n |B| Figure 1: A sequence x m 1 is generated conditioned on a sequence of random embeddings z m 1 ; generating the foreign sequence y n 1 further requires latent lexical alignments a n 1 .

EMBEDALIGN
In a nutshell, we model a distribution over pairs of sentences expressed in two languages, namely, a language of interest L1, and an auxiliary language L2 which our model uses to mine some learning signal. Our model, EMBEDALIGN, is governed by a simple generative story: 1. sample a length m for a sentence in L1 and a length n for a sentence in L2; 2. generate a sequence z 1 , . . . , z m of ddimensional random embeddings by sampling independently from a standard Gaussian prior; 3. generate a word observation x i in the vocabulary of L1 conditioned on the random embedding z i ; 4. generate a sequence a i , . . . , a n of n random alignments-each maps from a position a j in x m 1 to a position j in the L2 sentence; 5. finally, generate an observation y j in the vocabulary of L2 conditioned on the random embedding z a j that stands for x a j .
The model is parameterised by neural networks and parameters are estimated to maximise a lowerbound on log-likelihood of joint observations. In the following, we present the model formally ( §2.1), discuss efficient training ( §2.2), and concrete architectures ( §2.3).

Probabilistic model
Notation We use block capitals (e.g. X) for random variables, lowercase letters (e.g. x) for assignments, and the shorthand X m 1 for a sequence X 1 , . . . , X m . Boldface letters are reserved for deterministic vectors (e.g. v) and matrices (e.g. W).
Finally, E[f (Z); α] denotes the expected value of f (z) under a density q(z|α).
We model a joint distribution over bilingual parallel data, i.e., L1-L2 sentence pairs. An observation is a pair of random sequences X m 1 , Y n 1 , where a random variable X (Y ) takes on values in the vocabulary of L1 (L2). For ease of exposition, the length m (n) of each sequence is assumed observed throughout. The L1 sentence is generated one word at a time from a random sequence of latent embeddings Z m 1 , each Z taking on values in R d . The L2 sentence is generated one word at a time given a random sequence of latent alignments A n 1 , where A j ∈ {1, . . . , m} is the position in the L1 sentence to which y j aligns. 2 For i ∈ {1, . . . , m} and j ∈ {1, . . . , n} the generative story is and Figure 1 is a graphical depiction of our model. We map from latent embeddings to categorical distributions over either vocabulary using a neural network whose parameters are deterministic and collectively denote by θ (the generative parameters). The marginal likelihood of a sentence pair is shown in Equation (2).
Due to the conditional independences of our model, it is trivial to marginalise lexical alignments for any given latent embeddings z m 1 , but marginalising the embeddings themselves is intractable. Thus, we employ amortised mean field variational inference using the inference model where each factor is a diagonal Gaussian. We map from x m 1 to a sequence u m 1 of independent posterior mean (or location) vectors, where u i µ(h i ; φ), as well as a sequence s m 1 of independent standard deviation (or scale) vectors, where s i σ(h i ; φ), and h m 1 = enc(x m 1 ; φ) is a deterministic encoding of the L1 sequence (we discuss concrete architectures in §2.3). All mappings are realised by neural networks whose parameters are collectively denoted by φ (the variational parameters). Note that we choose to approximate the posterior without conditioning on y n 1 . This allows us to use the inference model for monolingual prediction in absence of L2 data.
Variational φ and generative θ parameters are jointly point-estimated to attain a local optimum of the evidence lowerbound (Jordan et al., 1999): (4) The variational family is location-scale, thus we can rely on stochastic optimisation (Robbins and Monro, 1951) and automatic differentiation (Baydin et al., 2015) with reparameterised gradient estimates (Kingma and Welling, 2014;Rezende et al., 2014;Titsias and Lázaro-Gredilla, 2014). Moreover, because the Gaussian density is an exponential family, the KL terms in (4) are available in closed-form (Kingma and Welling, 2014, Appendix B).

Efficient training
The likelihood terms in the ELBO (4) require evaluating two softmax layers over rather large vocabularies. This makes training prohibitively slow and calls for efficient approximation. We employ an approximation proposed by Botev et al. (2017) termed complementary sum sampling (CSS), which we review in this section.
Consider the likelihood term log P (X = x|z) that scores an observation x given a sampled embedding z-we use serif font x to distinguish a particular observation from an arbitrary event x ∈ X in the support. The exact class probability requires a normalisation over the complete support. CSS works by splitting the support into two sets, a set C that is explicitly summed over and must include the positive class x, and another set N that is a subset of the complement set X \ C. We obtain an estimate for the normaliser by importance-or Bernoulli-sampling from the support using a proposal distribution Q(X), where κ(x) corrects for bias as N tends to the entire complement set. In this paper, we design C and N per training mini-batch: we take C to consist of all unique words in a mini-batch of training samples and N to consist of 10 3 negative classes uniformly sampled from the complement set X \ C, in which case κ(x) = 10 −3 |X \ C|. 3 CSS makes it particularly easy to approximate likelihood terms such as those with respect to L2 in Equation (4). Because those terms depend on a marginalisation over alignments, an approximation must give support to all words in the sequence y n 1 . With CSS this is extremely simple, we just need to make sure all unique words in y n 1 are in the set C-which our mini-batch procedure does guarantee. Botev et al. (2017) show that CSS is rather stable and superior to the most popular softmax approximations. Besides being simple to implement, CSS also addresses a few problems with other approximations. To name a few: unlike importance sampling approximations, CSS converges to the exact softmax with bounded computation (it takes as many samples as there are classes). Unlike hierarchical softmax, CSS only affects training, that is, at test time we simply use the entire support instead of the approximation.
Without a softmax approximation, inference for our model would take time proportional to where v x (v y ) corresponds to the size of the vocabulary of L1 (L2). The first term (m × v x ) corresponds to projecting from m latent embeddings to m categorical distributions over the vocabulary of L1. The second term (m × v y ) corresponds to projecting the same m latent embeddings to m categorical distributions over the vocabulary of L2. Finally, the third term (m × n) is due to marginalisation of alignments.
Note, however, that with the CSS approximation we drop the dependency on vocabulary sizes (as the combined sizes of C and N is an independent constant). Moreover, if inference is performed on GPU, the squared term (m × n ≈ m 2 ) is amortised due to parallelism. Thus, while training our model is somewhat slower than monolingual models of word representation, which typically run in O(m), it is not at all impracticably slower.

Architectures
Here we present the neural network architectures that parameterise the different generative and variational components of §2.1. Refer to Appendix B for an illustration.
Generative model We have two generative components, namely, a categorical distribution over the vocabulary of L1 and another over the vocabulary of L2. We predict the parameter (event probabilities) of each distribution with an affine transformation of a latent embedding followed by the softmax nonlinearity to ensure normalisation: is the size of the vocabulary of L1 (L2). With the approximation of §2.2, we replace the L1 softmax layer (7a) by exp z i c x + b x normalised by the CSS estimate (6) at training, and similarly for the L2 softmax layer (7b). In that case, we have parameters for c x , c y ∈ R d -deterministic embeddings for x and y, respectively-as well as bias terms b x , b y ∈ R.
Inference model We predict approximate posterior parameters using two independent transformations are bias vectors, and the softplus nonlinearity ensures that standard deviations are non-negative. To obtain the deterministic encoding h m 1 , we employ two different architectures: (1) a bag-of-words (BOW) encoder, where h i is a deterministic projection of x i onto R dx ; and (2) a bidirectional (BIRNN) encoder, where h i is the element-wise sum of two LSTM hidden states (ith step) that process the sequence in opposite directions. We use 128 units for deterministic embeddings, and 100 units for LSTMs (Hochreiter and Schmidhuber, 1997) and latent representations (i.e. d = 100).

Experiments
We start the section describing the data used to estimate our model's parameters as well as details about the optimiser. The remainder of the section presents results on various benchmarks.
Training data We train our model on bilingual parallel data. In particular, we use parliament proceedings (Europarl-v7) (Koehn, 2005) from two language pairs: English-French and English-German. 4 We employed very minimal preprocessing, namely, tokenisation and lowercasing using scripts from MOSES (Koehn et al., 2007), and have discarded sentences longer than 50 tokens. Table 1 lists more information about the training data, including the English-French Giga web corpus (Bojar et al., 2014)  Optimiser For all architectures, we use the Adam optimiser (Kingma and Ba, 2014) with a learning rate of 10 −3 . Except where explicitly indicated, we • train our models for 30 epochs using mini batches of 100 sentence pairs; • use validation alignment error rate for model selection; • train every model 10 times with random Glorot initialisation (Glorot and Bengio, 2010) and report mean and standard deviation; • anneal the KL terms using the following schedule: we use a scalar α from 0 to 1 with additive steps of size 10 −3 every 500 updates.
This means that at the beginning of the training, we allow the model to overfit to the likelihood terms, but towards the end we are optimising the true ELBO (Bowman et al., 2016).
It is also important to highlight that we do not employ regularisation techniques (such as batch normalisation, dropout, or L 2 penalty) for they did not seem to yield consistent results.

Word alignment
Since our model leverages learning signal from parallel data by marginalising latent lexical alignments, we use alignment error rate to double check whether the model learns sensible word correspondences. Intrinsic assessment of word alignment quality requires manual annotation. For English-French, we use the NAACL English-French handaligned data (37 sentence pairs for validation and 447 for test) (Mihalcea and Pedersen, 2003). For English-German, we use the data by Padó and Lapata (2006)    We start by analysing validation results and selecting amongst a few variants of EMBEDALIGN. We investigate the use of annealing and the use of a bidirectional encoder in the variational approximation. Table 2 (3) lists ↓AER for EN-FR (EN-DE) as well as accuracy of word prediction. It is clear that both annealing (systems decorated with subscript α) and bidirectional representations improve the results across the board. In the rest of the paper we still investigate whether or not recurrent encoders help, but we always report results based on annealing.
In order to establish baselines for our models we report IBM models 1 and 2 (Brown et al., 1993). In a nutshell, IBM models 1 and 2 both estimate the conditional P (y j |x m 1 ) = m a j =1 P (a j |m)P (y j |x a j ) by marginalisation of latent lexical alignments. The only difference between the two models is the prior over alignments, which is uniform for IBM1 and categorical for IBM2. An important difference between IBM models and EMBEDALIGN concerns the lexical distribution. IBM models are parameterised with independent categorical parameters, while our model instead is parameterised by a neural network. IBM models condition on a single categorical event x a j , namely, the word aligned to. Our model instead conditions on the latent embedding z a j that stands for the word aligned to.
In order to establish even stronger conditional alignment models, we embed the conditioning words and replace IBM1's independent parameters by a neural network (single hidden layer MLP). We call this model a neural IBM1 (or NIBM for short). Note that in an IBM model, the sequence x m 1 is never modelled, therefore we can condition on it without restrictions. For that reason, we also experiment with a bidirectional LSTM encoder and condition lexical distributions on its hidden states.

Model
En   Table 4 shows AER for test predictions. First observe that neural models outperform classic IBM1 by far, some of them even approach IBM2's performance. Next, observe that bidirectional encodings make NIBM much stronger at inducing good word-to-word correspondences. EMBEDALIGN cannot catch up with NIBM, but that is not necessarily surprising. Note that NIBM is a conditional model, thus it can use all of its capacity to better explain L2 data. EMBEDALIGN, on the other hand, has to find a compromise between generating both streams of the data. To make that point a bit more obvious, Table 5 (6) lists accuracy of word prediction for EN-FR (EN-DE). Note that, without sacrificing L2 accuracy, and sometimes even improving it, EMBE-DALIGN achieves very high L1 accuracy. This still does not imply that induced representations have captured aspects of lexical semantics such as word senses. All this means is that we have induced features that are jointly good at reconstructing both streams of the data one word at time. Of course it is tempting to conclude that our models must be capturing some useful generalisations. For that, the next sections will investigate a range of semantic NLP tasks.

Lexical substitution task
The English lexical substitution task (LST) consists in selecting a substitute word for a target word in context (McCarthy and Navigli, 2009). In the most traditional variant of the task, systems are presented with a list of potential candidates and this list must be sorted by relatedness.
Dataset The LST dataset includes 201 target words present in 10 sentences/contexts each, along with a manually annotated list of potential replacements. The data are split in 300 instances for validation and 1, 710 for test. Systems are evaluated by  comparing the predicted ranking to the manual one in terms of generalised average precision (GAP) (Melamud et al., 2015).
Prediction We use EMBEDALIGN to encode each candidate (in context) as a posterior Gaussian density. Note that this task dispenses with inferences about L2. Each candidate is compared to the target word in context through a measure of overlap between their inferred densities-we take KL divergence. We then rank candidates using this measure. Table 7 lists GAP scores for variants of EM-BEDALIN (bottom section) as well as some baselines and other established methods (top section). For comparison, we also compute GAP by sorting candidates in terms of cosine similarity, in which case we take the Gaussian mean as a summary of the density. The top section of the table contains systems reported by Melamud et al. (2015) (RANDOM and SKIPGRAM) and by Brazinskas et al. (2017) (BSG). Note that both SKIPGRAM (Mikolov et al., 2013) and BSG were trained on the very large ukWaC English corpus (Ferraresi et al., 2008). SKIPGRAM is known to perform remarkably well regardless of its apparent insensitivity to context (in terms of design). BSG is a close relative of our model which gives SKIPGRAM a Bayesian treatment (also by means of amortised variational inference) and is by design sensitive to context in a manner similar to EMBEDALIGN, that is, through its inferred posteriors.
Our first observation is that cosine seems to outperform KL slightly. Others have shown that KL can be used to predict directional entailment (Vilnis and McCallum, 2014;Brazinskas et al., 2017), since LST is closer to paraphrasing than to entailment directionality may be a distractor, but we  Table 8: English sentence evaluation results: the last four rows correspond to the mean of 10 runs with EMBEDALIGN models. All models, but W2VEC, employ bidirectional encoders.
leave it as a rather speculative point. One additional point worth highlighting: the middle section of Table 7. EN BoW and EN BiRNN show what happens when we do not give EMBEDALIGN L2 supervision at training. That is, imagine the model of Figure 1 without the bottom plate. In that case, the model representations overfit for L1 word-byword prediction. Without the need to predict any notion of context (monolingual or otherwise), the representations drift away from semantic-driven generalisations and fail at lexical substitution.

Sentence Evaluation
Conneau et al. (2017) developed a framework to evaluate unsupervised sentence level representations trained on large amounts of data on a range of supervised NLP tasks. We assess our induced representations using their framework on the following benchmarks evaluated on classification ↑accuracy (MRPC is further evaluated on ↑F1) MR classification of positive or negative movie reviews; SST fined-grained labelling of movie reviews from the Stanford sentiment treebank (SST); TREC classification of questions into k-classes; CR classification of positive or negative product reviews; SUBJ classification of a sentence into subjective or objective; MPQA classification of opinion polarity; SICK-E textual entailment classification; MRPC paraphrase identification in the Microsoft paraphrase corpus; as well as the following benchmarks evaluated on the indicated correlation metric(s) SICK-R semantic relatedness between two sentences (↑Pearson); SST-14 semantic textual similarity (↑Pearson/Spearman).
Prediction We use EMBEDALIGN to annotate every word in the training set of the benchmarks above with the posterior mean embedding in context. We then average embeddings in a sentence and give that as features to a logistic regression classifier trained with 5-fold cross validation. 6 For comparison, we report a SKIPGRAM model (here indicated as W2VEC) as well as a model that uses the encoder of a neural machine translation system (NMT) trained on English-French Europarl data. In both cases, we report results by Conneau et al. (2017). Table 8 shows the results for all benchmarks. 7 We report EMBEDALIGN trained on either EN-FR or EN-DE. The last line (COMBO) shows what happens if we train logistic regression on the concatenation of embeddings inferred by both EM-BEDALIGN models, that is, EN-FR and EN-DE. Note that these two systems perform sometimes better sometimes worse depending on the benchmark. There is no clear pattern, but differences may well come from some qualitative difference in the induced latent space. It is a known fact that different languages realise lexical ambiguities differently, thus representations induced towards different languages are likely to capture different generalisations. 8 As COMBO results show, the representations induced from different corpora are somewhat complementary. That same observation has guided paraphrasing models based on pivoting (Bannard and Callison-Burch, 2005). Once more we report a monolingual variant of EMBEDALIGN (indicated by EN) in an attempt to illustrate how crucial the 6 http://scikit-learn.org/stable/ 7 In Appendix A we provide bar plots marked with error bars (2 standard deviations). translation signal is.

Word similarity
Word similarity benchmarks are composed of word pairs which are manually ranked out of context. For completeness, we also tried evaluating our embeddings in such benchmarks despite our work being focussed on applications where context matters.
Prediction To assign an embedding for a word type, we infer Gaussian posteriors for all training instances of that type in context and aggregate the posterior means through an average (effectively collapsing all instances).
To cover the vocabulary of the typical benchmark, we have to use a much larger bilingual collection than Europarl. Based on the results of §3.1, we decided to proceed with English-French onlyrecall that models based on that pair performed better in terms of AER. Results in this section are based on EMBEDALIGN (with bidirectional variational encoder) trained on the Giga web corpus (see Table 1 for statistics). Due to the scale of the experiment, we report on a single run.
We trained on Giga with the same hyperparameters that we trained on Europarl, however, for 3 epochs instead of 30 (with this dataset an epoch amounts to 183, 000 updates). Again, we performed model selection on AER. Table 9 shows the results for several datasets using the framework of Faruqui and Dyer (2014a). Note that EMBE-DALIGN was designed to make use of context information, thus this evaluation setup is a bit unnatural for our model. Still, it outperforms SKIP-GRAM on 5 out of 13 benchmarks, in particular, on SIMLEX-999, whose relevance has been argued by Upadhyay et al. (2016). We also remark that this model achieves 0.25 test AER and 45.16 test GAP on lexical substitution-a considerable improvement compared to models trained on Europarl and reported in Tables 4 (AER) and 7 (GAP).

Related work
Our model is inspired by lexical alignment models such as IBM1 (Brown et al., 1993), however, we generate words y n 1 from a latent vector representation z m 1 of x m 1 , rather than directly from the observation x m 1 . IBM1 takes L1 sequences as conditioning context and does not model their distribution. Instead, we propose a joint model, where L1 sentences are generated from latent embeddings.  Table 9: Evaluation of English word embeddings out of context in terms of Spearman's rank correlation coefficient (↑). The first column is from (Faruqui and Dyer, 2014a).
There is a vast literature on exploiting multilingual context to strengthen the notion of synonymy captured by monolingual models. Roughly, the literature splits into two groups, namely, approaches that derive additional features and/or training objectives based on pre-trained alignments (Klementiev et al., 2012;Faruqui and Dyer, 2014b;Luong et al., 2015;Šuster et al., 2016), and approaches that promote a joint embedding space by working with sentence level representations that dispense with explicit alignments (Hermann and Blunsom, 2014;AP et al., 2014;Gouws et al., 2015;Hill et al., 2014).
The work of Kočiský et al. (2014) is closer to ours in that they also learn embeddings by marginalising alignments, however, their model is conditional-much like IBM models-and their embeddings are not part of the probabilistic model, but rather part of the architecture design. The joint formulation allows our latent embeddings to harvest learning signal from L2 while still being driven by the learning signal from L1-in a conditional model the representations can become specific to alignment deviating from the purpose of well representing the original language. In §3 we show substantial evidence that our model performs better when using both learning signals. Vilnis and McCallum (2014) first propose to map words into Gaussian densities instead of point estimates for better word representation. For example, a distribution can capture asymmetric relations that a point estimate cannot. Brazinskas et al. (2017) recast the skip-gram model as a conditional variational auto-encoder. They induce a Gaussian density for each occurrence of a word in context, and for that their model is the closest to ours, but training is based on prediction of neighbouring words. Unlike our model, the Bayesian skip-gram still requires generation of negative samples for discriminative training. It is perhaps worth highlighting that, in principle, both strategies can be combined.

Discussion
We have presented a generative model of word representation that learns from positive correlations implicitly expressed in translation data. In order to make these correlations surface, we induce and marginalise latent lexical alignments.
Embedding models such as CBOW and skipgram (Mikolov et al., 2013) are essentially speaking supervised classifiers. This means they depend on somewhat artificial strategies to derive labelled data from monolingual corpora-words far from the central word still have co-occurred with it even though they are taken as negative evidence. Training our proposed model does not require a heuristic notion of negative training data. However, the model is also based on a somewhat artificial assumption: L1 words do not necessarily need to have an L2 equivalent and, even when they do, this equivalent need not be realised as a single word.
We have shown with extensive experiments that our model can induce representations useful to several tasks including but not limited to alignment (the task it most obviously relates to). We observed interesting results on semantic natural language processing benchmarks such as natural language inference, lexical substitution, paraphrasing, and sentiment classification.
We are currently expanding the notion of distributional context to multiple auxiliary foreign languages at once. This seems to only require minor changes to the generative story and could increase the model's disambiguation power dramatically. Another direction worth exploring is to extend the model's hierarchy with respect to how parallel sentences are generated. For example, modelling sentence level latent variables may capture global constraints and expose additional correlations to the model. Figure 2 shows multiple runs of our proposed model on sentence evaluation. The first figure reports the mean and two standard deviations (error bars) for benchmarks based on accuracy (ACC), the second figure reports benchmarks based on F1, and finally the third figure reports benchmarks based on correlation metrics Spearman (S) and Pearson (P).