Learning Unsupervised Word Translations Without Adversaries

Word translation, or bilingual dictionary induction, is an important capability that impacts many multilingual language processing tasks. Recent research has shown that word translation can be achieved in an unsupervised manner, without parallel seed dictionaries or aligned corpora. However, state of the art methods unsupervised bilingual dictionary induction are based on generative adversarial models, and as such suffer from their well known problems of instability and hyper-parameter sensitivity. We present a statistical dependency-based approach to bilingual dictionary induction that is unsupervised – no seed dictionary or parallel corpora required; and introduces no adversary – therefore being much easier to train. Our method performs comparably to adversarial alternatives and outperforms prior non-adversarial methods.


Introduction
Translating words between languages, or more generally inferring bilingual dictionaries, is a long-studied research direction with applications including machine translation (Lample et al., 2017), multilingual word embeddings (Klementiev et al., 2012), and knowledge transfer to low resource languages (Guo et al., 2016). Research here has a long history under the guise of decipherment (Knight et al., 2006). Current contemporary methods have achieve effective word translation through theme-aligned corpora (Gouws et al., 2015), or seed dictionaries (Mikolov et al., 2013). Mikolov et al. (2013) showed that monolingual word embeddings exhibit isomorphism across languages, and can be aligned with a simple linear transformation. Given two sets word vectors learned independently from monolingual corpora, and a dictionary of seed pairs to learn a linear transformation for alignment; they were able to estimate a complete bilingual lexicon. Many studies have since followed this approach, proposing various improvements such as orthogonal mappings (Artetxe et al., 2016) and improved objectives (Lazaridou et al., 2015).
Obtaining aligned corpora or bilingual seed dictionaries is nevertheless not straightforward for all language pairs. This has motivated a wave of very recent research into unsupervised word translation: inducing bilingual dictionaries given only monolingual word embeddings (Conneau et al., 2018;Zhang et al., 2017b,a;Artetxe et al., 2017). The most successful have leveraged ideas from Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). In this approach the generator provides the cross-modal mapping, taking embeddings of dictionary words in one language and 'generating' their translation in another. The discriminator tries to distinguish between this 'fake' set of translations and the true dictionary of embeddings in the target language. The two play a competitive game, and if the generator learns to fool the discriminator, then its cross-modal mapping should be capable of inducing a complete dictionary, as per Mikolov et al. (2013).
Despite these successes, such adversarial methods have a number of well-known drawbacks (Arjovsky et al., 2017): Due to the nature of their min-max game, adversarial training is very unstable, and they are prone to divergence. It is extremely hyper-parameter sensitive, requiring problem-specific tuning. Convergence is also hard to diagnose and does not correspond well to efficacy of the generator in downstream tasks (Hoshen and Wolf, 2018).
In this paper, we propose an alternative statistical dependency-based approach to unsupervised word translation. Specifically, we propose to search for the cross-lingual word pairing that maximizes statistical dependency in terms of squared loss mutual information (SMI) (Yamada et al., 2015;Suzuki and Sugiyama, 2010). Compared to prior statistical dependency-based approaches such as Kernelized Sorting (KS) (Quadrianto et al., 2009) we advance: (i) through use of SMI rather than their Hilbert Schmidt Independence Criterion (HSIC) and (ii) through jointly optimising cross-modal pairing with representation learning within each view. In contrast to prior work that uses a fixed representation, by non-linearly projecting monolingual world vectors before matching, we learn a new embedding where statistical dependency is easier to establish. Our method: (i) achieves similar unsupervised translation performance to recent adversarial methods, while being significantly easier to train and (ii) clearly outperforms prior non-adversarial methods.
2 Proposed model

Deep Distribution Matching
Let dataset D contain two sets of unpaired monolingual word embeddings from two languages Let π be a permutation function over {1, 2, . . . , n}, and Π the corresponding permutation indicator matrix: Π ∈ {0, 1} n×n , Π1 n = 1 n , and Π 1 n = 1 n . Where 1 n is the n-dimensional vector with all ones. We aim to optimize for both the permutation Π (bilingual dictionary), and non-linear transformations g x (·) and g y (·) of the respective wordvectors, that maximize statistical dependency between the views. While regularising by requiring the original word embedding information is preserved through reconstruction using decoders f x (·) and f y (·). Our overall loss function is: where Θs parameterize the encoding and reconstruction transformations, R(·) is a regularizer (e.g., 2 -norm and 1 -norm), and D Π (·, ·) is a statistical dependency measure. Crucially compared to prior methods such as matching CCA (Haghighi et al., 2008), dependency measures such as SMI do not need comparable representations to get started, making the bootstrapping problem less severe.

Squared-Loss Mutual Information (SMI)
The squared loss mutual information between two random variables x and y is defined as (Suzuki and Sugiyama, 2010): which is the Pearson divergence (Pearson, 1900) from p(x, y) to p(x)p(y). The SMI is an fdivergence (Ali and Silvey, 1966). That is, it is a non-negative measure and is zero only if the random variables are independent.
To measure SMI from a set of samples we take a direct density ratio estimation approach (Suzuki and Sugiyama, 2010), which leads (Yamada et al., 2015) to the estimator: where K ∈ R n×n and L ∈ R n×n are the gram matricies for x and y respectively, and λ > 0 is a regularizer and I n ∈ R n×n is the identity matrix. SMI for Matching SMI computes the dependency between two sets of variables, under an assumption of known correspondence. In our application this corresponds to a measure of dependency between two aligned sets of monolingual wordvectors. To exploit SMI for matching, we introduce a permutation variable Π by replacing L → Π LΠ in the estimator: that will enable optimizing Π to maximize SMI.

Optimization of parameters
To initialize Θ x and Θ y , we first independently estimate them using autoencoders. Then we employ an alternative optimization on Eq. (1) for (Θ x , Θ y ) and Π until convergence. We use 3 layer MLP neural networks for both f and g. Algorithm 1 summarises the steps. Optimization for Θ x and Θ y With fixed permutation matrix Π (or π), the objective function min Θx,Θy is an autoencoder optimization with regularizer D Π (·), and can be solved with backpropagation.
Optimization for Π To find the permutation (word matching) Π that maximizes SMI given fixed encoding parameters Θ x , Θ y , we only need to optimize the dependency term D Π in Eq. (1). We employ the LSOM algorithm (Yamada et al., 2015). The estimator of SMI for samples {g x (x i ), g y (y π(i) )} n i=1 encoded with g x , g y is: Which leads to the optimization problem: Since the optimization problem is NP-hard, we iteratively solve the relaxed problem (Yamada et al., 2015): where 0 < η ≤ 1 is a step size. The optimization problem is a linear assignment problem (LAP). Thus, we can efficiently solve the algorithm by using the Hungarian method (Kuhn, 1955). To get discrete Π, we solve the last step by setting η = 1.
Intuitively, this can be seen as searching for the permutation Π for which the data in the two (initially unsorted views) have a matching withinview affinity (gram) matrix, where matching is defined by maximum SMI.

Experiments
In this section, we evaluate the efficacy of our proposed method against various state of the art methods for word translation. Implementation Details Our autoencoder consists of two layers with dropout and a tanh nonlinearity. We use polynomial kernel to compute Algorithm 1 SMI-based unsupervised word translation Input: Unpaired word embeddings D = ({x i } n i=1 , {y j } n j=1 ). 1: Init: weights Θ x , Θ y , permutation matrix Π. 2: while not converged do 3: Update Θ x , Θ y given Π: Backprop (2).

4:
Update Π given Θ x , Θ y : LSOM (3). 5: end while Output: Permutation Matrix Π. Params Θ x , Θ y . the gram matrices K and L. For all pairs of languages, we fix the number of training epochs to 20. All the word vectors are 2 unit normalized. For CSLS we set the number of neighbors to 10. For optimizing Π at each epoch, we set the step size η = 0.75 and use 20 iterations. For the regularization R(Θ), we use the sum of the Frobenius norms of weight matrices. We train Θ using full batch gradient-descent, with learning rate 0.05.

Datasets
We performed experiments on the publicly available English-Italian, English-Spanish and English-Chinese datasets released by Zhang et al., 2017b;Vulic and Moens, 2013). We name this collective set of benchmarks BLI. We also conduct further experiments on a much larger recent public benchmark, MUSE (Conneau et al., 2018) 1 .

Setting and Metrics
We evaluate all methods in terms of Precision@1, following standard practice. We note that while various methods in the literature were initially presented as fully supervised (Mikolov et al., 2013), semi-supervised (using a seed dictionary) (Haghighi et al., 2008), or unsupervised (Zhang et al., 2017b), most of them can be straightforwardly adapted to run in any of these settings. Therefore we evaluate all methods both in the unsupervised setting in which we are primarily interested, and also the commonly evaluated semi-supervised setting with 500 seed pairs. Competitors: Non-Adversarial In terms of competitors that, like us, do not make use of GANs, we evaluate: Translation Matrix (Mikolov et al., 2013), which alternates between estimating a linear transformation by least squares and matching by nearest neighbour (NN). Multilingual Correlation (Faruqui and Dyer, 2014), and Matching CCA (Haghighi et al., 2008), which alternates between matching and estimat-

MUSE Dataset BLI Datasets
Methods es-en en-es it-en en-it zh-en en-zh es-en en-es it-en en-it zh-en en-zh TM (Mikolov et al., 2013)   ing a joint linear subspace. Kernelized Sorting (Quadrianto et al., 2009), which directly uses HSIC-based statistical dependency to match heterogeneous data points. Self Training (Artetxe et al., 2017) A recent state of the art method that alternate between estimating an orthonormal transformation, and NN matching.
Competitors: Adversarial In terms of competitors that do make use of adversarial training, we compare: W-GAN and EMDOT (Zhang et al., 2017b) make use of adversarial learning using Wasserstein GAN and Earth Movers Distance respectively. GAN-NN (Conneau et al., 2018) uses adversarial learning to train an orthogonal transformation, along with some refinement steps and an improvement to the conventional NN matching procedure called 'cross-domain similarity lo-cal scaling' (CSLS). Since this is a distinct step, we also evaluate our method with CSLS. We use the provided code for GAN-NN and Self-Train, while re-implementing EDOT/W-GAN to avoid dependency on theano. Table 1 presents comparative results for unsupervised word translation on BLI and MUSE. From these we observe: (i) Our method (bottom) is consistently and significantly better than non-adversarial alternatives (top). (ii) Compared to adversarial alternatives Deep-SMI performs comparably.

Fully Unsupervised
All methods generally perform better on the MUSE dataset than BLI. These differences are due to a few factors: MUSE is a significantly larger dataset than BLI, benefitting methods that can exploit a large amount of training data. In the ground-truth annotation, BLI contains 1-1 translations while MUSE contains more realistic 1-many translations (if any correct translation is picked, a success is counted), making it easier to reach a higher score.

Semi-supervised
Results using a 500-word bilingual seed dictionary are presented in Table 2. From these we observe: (i) The conventional methods' performances (top) jump up, showing that they are more competitive if at least some sparse data is available. (ii) Deep-SMI performance also improves, and still outperforms the classic methods significantly overall. (iii) Again, we perform comparably to the GAN methods. Figure 1 shows the convergence process of Deep-SMI. From this we see that: (i) Unlike the adversarial methods, our objective (Eq. (1)) improves smoothly over time, making convergence much easier to assess. (ii) Unlike the adversarial methods, our accuracy generally mirrors the model's loss. In contrast, the various losses of the adversarial approaches do not well reflect translation accuracy, making model selection or early stopping a challenge in itself. Please compare our There are two steps in our optimization: matching permutation Π and representation weights Θ. Although this is an alternating optimization, it is analogous to an EM-type algorithm optimizing latent variables (Π) and parameters (Θ). While local minima are a risk, every optimisation step for either variable reduces our objective Eq. (1).

Discussion
There is no min-max game, so no risk of divergence as in the case of adversarial GAN-type methods.
Our method can also be understood as providing an unsupervised Deep-CCA type model for relating heterogeneous data across two views. This is in contrast to the recently proposed unsupervised shallow CCA (Hoshen and Wolf, 2018), and conventional supervised Deep-CCA (Chang et al., 2018) that requires paired data for training; and using SMI rather than correlation as the optimisation objective.