Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders

Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analyzing the monolingual distribution of words. In order to evaluate this hypothesis, we propose a scheme to map word vectors trained on a source language to vectors semantically compatible with word vectors trained on a target language using an adversarial autoencoder. We present preliminary qualitative results and discuss possible future developments of this technique, such as applications to cross-lingual sentence representations.


Introduction
Distributed representations that map words, sentences, paragraphs or documents to vectors real numbers have proven extremely useful for a variety of natural language processing tasks (Bengio et al., 2006;Collobert and Weston, 2008;Turian et al., 2010;Maas et al., 2011;Mikolov et al., 2013b;Socher et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014;Le and Mikolov, 2014;Baroni et al., 2014;Levy et al., 2015), as they provide an effective way to inject into machine learning models general prior knowledge about language automatically obtained from inexpensive unannotated corpora. Based on the assumption that different languages share a similar semantic struc-ture, various approaches succeeded to obtain distributed representations that are compatible across multiple languages, either by learning mappings between different embedding spaces (Mikolov et al., 2013a;Faruqui and Dyer, 2014) or by jointly training cross-lingual representations (Klementiev et al., 2012;Hermann and Blunsom, 2013;Chandar et al., 2014;Gouws et al., 2014). These approaches all require some amount of parallel text, aligned at word level, sentence level or at least document level, or some other kind of parallel resources such as dictionaries (Ammar et al., 2016).
In this work we explore whether the assumption of a shared semantic structure between languages is strong enough that it allows to induce compatible distributed representations without using any parallel resource. We only require monolingual corpora that are thematically similar between languages in a general sense.
We hypothesize there exist a suitable vectorial space such that each language can be viewed as a random process that produces vectors at some level of granularity (words, sentences, paragraphs, documents) which are then encoded as discrete surface forms, and we hypothesize that, if languages are used to convey thematically similar information in similar contexts, these random processes should be approximately isomorphic between languages, and that this isomorphism can be learned from the statistics of the realizations of these processes, the monolingual corpora, in principle without any form of explicit alignment.
We motivate this hypothesis by observing that humans, especially young children, who acquire multiple languages, can often do so with relatively little exposure to explicitly aligned parallel linguistic information, at best they may have access to distant and noisy alignment information in the form of multisensorial environmental clues. Nevertheless, multilingual speakers are always au-tomatically able to translate between all the languages that they can speak, which suggests that their brain either uses a shared conceptual representations for the different surface features of each language, or uses distinct but near-isomorphic representations that can be easily transformed into each other.
2 Learning word embedding cross-lingual mappings with adversarial autoencoders The problem of learning transformations between probability distributions of real vectors has been studied in the context of generative neural network models, with approaches such as Generative Moment Matching Networks (GMMNs) (Li et al., 2015) and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In this work we consider GANs, since their effectiveness has been demonstrated in the literature more thoroughly than GMMNs.
In a typical GAN, we wish to train a generator model, usually a neural network, to transform samples from a known, easy to sample, uninformative distribution (e.g. Gaussian or uniform) into samples distributed according to a target distribution defined implicitly by a training set. In order to do so, we iteratively alternate between training a differentiable discriminator model, also a neural network, to distinguish between training samples and artificial samples produced by the generator, and training the generator to fool the discriminator into misclassifying the artificial examples as training examples. This can be done with conventional gradient-based optimization because the discriminator is differentiable thus it can backpropagate gradients into the generator.
It can be proven that, with sufficient model capacity and optimization power, sufficient entropy (information dimension) of the generator input distribution, and in the limit of infinite training set size, the generator learns to produce samples from the correct distribution. Intuitively, if there is any computable test that allows to distinguish the artificial samples from the training samples with better than random guessing probability, then a sufficiently powerful discriminator will eventually learn to exploit it and then a sufficiently powerful generator will eventually learn to counter it, until the generator output distribution becomes undistinguishable from the true training distribution. In practice, actual models have finite capacity and gradient-based optimization algorithms can become unstable or stuck when applied to this multi-objective optimization problem, though hey have been successfully used to generate fairly realistic-looking images (Denton et al., 2015;Radford et al., 2015).
In our preliminary experiments we attempted to adapt GANs to our problem, by training the generator to learn a transformation between word embeddings trained on different languages ( fig. 1). Let d be the embedding dimensionality, G θ G : At each training step: 1. draw a sample {f } n of n source embeddings, according to their (adjusted) word frequencies 2. transform them into target-like embeddings 3. evaluate them with the discriminator, estimating their probability of having been sampled from the true target distribution {p} n = D θ D ({ê}) 4. update the generator parameters θ G to reduce the average adversarial loss L a = − log({p} n ) 5. draw a sample {e} n of n true target embeddings 6. update the discriminator parameters θ D to reduce its binary cross-entropy loss on the classification between {e} n (positive class) and {ê} (negative class) repeat these steps until convergence. Unfortunately we found that in this setup, even with different network architectures and hyperparameters, the model quickly converges to a pathological solution where the generator always emits constant or near-constant samples that somehow can fool the discriminator. This appears to be an extreme case of the know mode-seeking issue of GANs (Radford et al., 2015;Theis et al., 2015;Salimans et al., 2016), which is probably exacerbated in our settings because of the point-mass nature of our probability distributions where each word embedding is a mode on its own.
In order to avoid these pathological solutions, we needed a way to penalize the generator for destroying too much information about its input. Therefore we turned our attention to Adversarial Autoencoders (AAE) (Makhzani et al., 2015). In an AAE, the generator, now called encoder, is paired with another model, the decoder R θ R : R d → R d parametrized by θ R which attempts to transform the artificial samples emitted by the encoder back into the input samples. The encoder and the decoder are jointly trained to minimize a combination of the average reconstruction loss L r ({f } n , R θ R (G θ G ({f } n ))) and the adversarial loss defined as above. The discriminator is trained as above. In the original formulation of the AAE, the discriminator is used to enforce a known prior (e.g. Gaussian or Gaussian mixture) on the intermediate, latent representation, in our setting instead we use it to match the latent representation to the target embedding distribution so that the encoder can be used to transform source embeddings into target ones ( fig. 2).
In our experiments, we use the cosine dissimilarity as reconstruction loss, and as a further penalty we also include the pairwise cosine dissimilarity between the generated latent samples {ê} and the true target samples {e} n . Therefore, the total loss incurred by the encoder-decoder at each step is  where λ r , λ a and λ c are hyperparameters (all set equal to 1 in our experiments).

Experiments
We performed some preliminary exploratory experiments on our model. In this section we report salient results.
The first experiment is qualitative, to assess whether our model is able to learn any semantically sensible transformation at all. We consider English to Italian embedding mapping.
We train English and Italian word embeddings on randomly subsampled Wikipedia corpora consisting of about 1.5 million sentences per language. We use word2vec (Mikolov et al., 2013b) in skipgram mode to generate embeddings with dimension d = 100. Our encoder and decoder are linear models with tied matrices (one the transpose of the other), initialized as random orthogonal matrices (we also explored deep non-linear autoencoders but we found that they make the optimization more difficult without providing apparent benefits).
Our discriminator is a Residual Network (He et al., 2015) without convolutions, one leaky ReLU non-linearity (Maas et al., 2013) per block, no non-linearities on the passthrough path, batch normalization (Ioffe and Szegedy, 2015) and dropout (Nitish et al., 2014). The block (layer) equation is: where W t is a weight matrix and φ is batch normalization (with its internal parameters) followed by leaky ReLU and h t is a k-dimensional block state (in our experiments k = 40). The network has T = 10 blocks followed by a 1-dimensional output layer with logistic sigmoid activation. We found that using a Residual Network as discriminator rather than a standard multi-layer perceptron yields larger gradients being backpropagated to the generator, facilitating training. We actually train two discriminators per experiment, with identical structure but different random initializations, and use one to train the generator and the other for monitoring in order to help us determine whether overfitting or underfitting occurs. At each step, word embeddings are sampled according to their frequency in the original corpora, adjusted to subsample frequent words, as in word2vec. Updates are performed using the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001 for the encoder-decoder and 0.01 for the discriminator.
The code 1 is implemented in Python, Theano (Theano Development Team, 2016) and Lasagne.
We further evaluate our model on German to English and English to German embedding transformations, using the same evaluation setup as (Klementiev et al., 2012) with embeddings trained on the concatenation of the Reuters corpora and the News Commentary 2015 corpora, with embedding dimension d = 40 and discriminator depth T = 4. On a qualitative analysis notice similar partial semantic similarity patterns. However the cross-lingual document classification task we were able to improve over the baseline only for the smallest training set size.

Discussion and future work
From the qualitative analysis of the word embedding mappings it appears that the model does learn to transfer some semantic information, although it's not competitive with other cross-lingual representation approaches. This may be possibly an issue of hyperparameter choice and architectural details, since, to our knowledge, this is the first work to apply adversarial training techniques to point-mass distribution arising from NLP tasks.
Further experimentation is needed to determine whether the model can be improved or whether we already hit a fundamental limit on how much semantic transfer can be performed by monolingual distribution matching alone. This additional experimentation may help to test how strongly our initial hypothesis of semantic isomorphism between languages holds, in particular across languages of different linguistic families.
Even if this hypothesis does not hold in a strong sense and semantic transfer by monolingual text alone turns out to be infeasible, our technique might help in conjunction with training on parallel data. For instance, in neural machine translation "sequence2sequence" transducers without attention , it could be useful to train as usual on parallel sentences and train in autoencoder mode on monolingual sentences, using an adversarial loss computed by a discriminator on the intermediate latent representations to push them to be isomorphic between languages. A modification of this technique that allows for the latent representation to be variable-sized could be also applied to the attentive "sequence2sequence" transducers , as an alternative or in addition to monolingual dataset augmentation by backtranslation (Sennrich et al., 2015).
Furthermore, it may be worth to evaluate additional distribution learning approaches such as the aforementioned GMMs, as well as the more recent BiGAN/ALI framework (Donahue et al., 2016;Dumoulin et al., 2016) which uses an adver-sarial discriminator loss both to match latent distributions and to enforce reconstruction, and also to consider more recent GAN training techniques (Salimans et al., 2016).
In conclusion we believe that this work initiates a potentially promising line of research in natural language processing consisting of applying distribution matching techniques such as adversarial training to learn isomorphisms between languages.