Learning Neural Representation for CLIR with Adversarial Framework

The existing studies in cross-language information retrieval (CLIR) mostly rely on general text representation models (e.g., vector space model or latent semantic analysis). These models are not optimized for the target retrieval task. In this paper, we follow the success of neural representation in natural language processing (NLP) and develop a novel text representation model based on adversarial learning, which seeks a task-specific embedding space for CLIR. Adversarial learning is implemented as an interplay between the generator process and the discriminator process. In order to adapt adversarial learning to CLIR, we design three constraints to direct representation learning, which are (1) a matching constraint capturing essential characteristics of cross-language ranking, (2) a translation constraint bridging language gaps, and (3) an adversarial constraint forcing both language and media invariant to be reached more efficiently and effectively. Through the joint exploitation of these constraints in an adversarial manner, the underlying cross-language semantics relevant to retrieval tasks are better preserved in the embedding space. Standard CLIR experiments show that our model significantly outperforms state-of-the-art continuous space models and is better than the strong machine translation baseline.


Introduction
Text representation is a crucial problem in most natural language processing (NLP) and information retrieval (IR) tasks. In monolingual IR, early research works mostly use vector space models for query-document semantic matching (Salton et al., 1975), which suffer from the problem of synonymy and polysemy. In order to bridge the lexical gaps, latent semantic models such as latent semantic analysis (LSA) (Deerwester et al., 1990) have been proposed to abstract away from surface text forms to approximate semantics. More recently, text representation learned with neural networks is attracting increasing attention of the IR community (Mitra and Craswell, 2017) and positive results have been reported on various evaluation data sets (Fan et al., 2018).
Compared to the prosperity in monolingual IR, there have been less advancements in CLIR where documents are written in a language different from that of queries. In addition to document ranking, CLIR models need to cross the language barriers, which makes the task intuitively more difficult than monolingual IR. Traditional approaches reduce CLIR to its monolingual counterpart via performing some way of translation on queries or/and documents. The typical translation process is performed with either off-the-shelf machine translation (MT) systems or multilingual dictionaries (Nie, 2010). However, MT based approaches are far from being a suitable solution for solving CLIR problems (refer to detailed analysis in (Zhou et al., 2012)). Dictionary-based approaches suffer from the same problem of lexical gaps as in the monolingual case (Gupta et al., 2017). An efficient cross-language representation is in need for CLIR, which is expected to be able to cross both the language and lexical gaps.
The most intuitive idea one can have so as to represent text in cross-language settings is to extend those models in monolingual environment. For instance, we note studies such as the extension of LSA in (Littman et al., 1998), the extension of principle component analysis (PCA) in (Platt et al., 2010), the extension of autoencoder model in (Chandar et al., 2014), and the extension of word2vec (Mikolov et al., 2013) in (Vulić and Moens, 2015). These approaches construct crosslanguage and semantic-rich representation of text, which can be applied to CLIR directly. However, all the models listed here aim to learn general text representation where the objective is to capture term proximity rather than relevance that is essential for retrieval task . A recent work (Gupta et al., 2017) tries to learn taskspecific representation for CLIR. However, their model only captures ranking signals in monolingual settings, which does not necessarily generalize well in CLIR.
In this paper, we propose to learn task-specific text representation for CLIR via a novel adversarial learning framework. Following the convention in generative adversarial networks (GAN) (Goodfellow et al., 2014), our representation learning model is realized as an interplay between two processes, an embedding generator (G) and an adversarial discriminator (D), conducted as a minmax game. With the GAN framework, we design three constraints to direct the representation learning process. CLIR is essentially a ranking problem and we develop a matching constraint to make sure that documents can be ranked in the right order given a query in another language. The matching constraint considers both cross-language and monolingual pairwise ranking signals, which is superior to previous studies (e.g., (Gupta et al., 2017)) only considering monolingual matching signals. Meanwhile, a translation constraint is imposed on the latent representation to bridge the language gaps. These two constraints direct the encoding networks to generate a language-invariant and task-specific representation in the embedding space. Lastly, an adversarial constraint is proposed to force both language and source invariant to be reached more efficiently and effectively. Through the joint exploitation of these constraints in an adversarial manner, the embedding space being optimal for CLIR will then result through the convergence of this process. Comprehensive CLIR experiments reveal that our model is superior to state-of-the-art continuous space models and approaches the machine translation and monolingual baselines.

Related work
Text representation has been a long-standing research question in IR. Classic methods such as vector space model are not able to deal with lexical gaps between queries and documents, resulting in inferior retrieval performance. Latent semantic approaches such as LSA (Deerwester et al., 1990) and latent dirichlet allocation (LDA) (Blei et al., 2003) abstract away from surface text forms to alleviate sparsity and approximate semantics. More recently, learning based approaches with neural networks have gained great success in NLP (Baroni et al., 2014) and started to attract increasing interests of the IR community. In terms of word level embedding, word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) are two models that have been cited frequently in recent literature. These two models provide semantic-rich representations to bridge lexical gaps between queries and documents, which have been used broadly in neural IR studies (Ganguly et al., 2015;Zheng and Callan, 2015;Zamani and Croft, 2016).
The above studies deal with monolingual text representation, which are related to the crosslanguage models presented below. As for CLIR, typical approaches reduce CLIR to its monolingual counterparts via performing some way of translation. Machine translation systems such as Google translator 1 have been widely used to translate queries or documents, which serve as a default and convenient translation option in CLIR. It is however far from being a suitable solution for solving CLIR problems (a detailed analysis can be found in (Zhou et al., 2012)). An alternative solution is to rely on multilingual dictionaries to perform lexicon-level translation, which is mostly in combination with either language modeling strategy (Kraaij et al., 2003) or query structuring framework (Pirkola, 1998). However, dictionary-based methods still suffer from the lexical gap problem which reduces their performance in CLIR.
In fact, researchers have extended the models in monolingual settings and developed continuous space models for cross-language tasks to capture rich semantics. These cross-language extensions can be applied to CLIR directly. For instance, Littman et al. (1998) extend LSA to its cross-language version CL-LSA by concatenating document-term matrix of parallel data which acts as dual-language documents to be learned by LSA. Such a methodology leads to a dual-language semantic space in which terms from both languages are represented. Vinokourov et al. (2002) use parallel data to find most likely correlations between projected vectors based on canonical component analysis technique. The OPCA model (Platt et al., 2010) starts with the basic model PCA that is then made discriminative by encouraging comparable document pairs to have similar vector representation. Compared to CL-LSA, OPCA avoids the use of artificial concatenated documents. More recently, neural models have been employed to learn cross-language representations. For instance, autoencoder is extended to a bilingual version BAE in (Chandar et al., 2014) which learns vectorial word representations from aligned sentences. Yih et al. (2011) develop S2Net to learn a projection matrix to map the corresponding term vectors into a latent space where similar documents are close. S2Net is implemented with Siamese neural network framework. Vulić and Moens (2015) first merge two documents from the aligned document pair in a comparable corpus and then train word2vec on the pseudo-bilingual document to obtain cross-language embeddings. The above approaches learn general text representation that captures term proximity rather than relevance which is important for retrieval task (Zamani and Croft, 2017). A recent work (Gupta et al., 2017) tries to learn task-specific embeddings for CLIR. However, it learns ranking signals by preserving pairwise ranking in monolingual settings prior to a transfer learning process to another language, which does not necessarily generalize well in CLIR.
One can find from above analysis that, most existing approaches, either based on neural networks or not, learn general embeddings irrelevant to CLIR. We argue that task-specific embeddings are superior, a fact that is inspired by monolingual IR studies and that will actually be validated by CLIR experiments in this paper. To this end, we will learn cross-language and task-specific embeddings for CLIR via a novel text representation model based on adversarial learning (Goodfellow et al., 2014).

Representation learning framework
We will present in this section a neural representation learning framework for CLIR. As discussed before, the framework is realized based on adversarial learning as an interplay between the generator process and the discriminator process. We will develop three constraints, namely a matching constraint, a translation constraint and an adversarial constraint, to direct the learning of cross-language and target-specific text embeddings. For ease of presentation, let us assume in CLIR we have a source language query q s and a target language document d t . The translation of q s in the target language is q t . The learning framework is illustrated in figure 1, which consists of an adversarial network N N adv , three dimension adaptation networks N N dim and three encoding networks respectively for q t , d t and q s .

Text representation networks
There have been various approaches one can use to encode sentences/documents into dense vectors. For instance, models based on convolutional neural networks (Kalchbrenner et al., 2014) and models based on recurrent neural networks (Liu et al., 2016) have been popular choices. In order to map queries and documents into the embedding space, we make use of recurrent neural network with the long short-term memory (LSTM) architecture that can deal with vanishing and exploring gradient problems (Hochreiter and Schmidhuber, 1997). We present here derivation details of LSTM for clarification sake. The LSTM framework consists of several gates to control the cell state in the network. Firstly, a forget gate f (a sigmoid layer) functions according to: Then, an input gate i (a sigmoid layer) and a tanh layer work together as follows: With the forget gate f , the input gate i and the new value C, one can update the cell state C as: Lastly, an output gate o (a sigmoid layer) outputs: In above equations, x τ is the input at time step τ . h τ and h τ −1 denote the hidden states at time steps τ and τ − 1. All W and b are parameters. For brevity, we can write the update process as: Given a text sequence x = (x 1 , x 2 , . . . , x l ), typical methods take the output h l of LSTM at the last time step l as the concentrated representation of the whole sequence x (Sutskever et al., 2014). Since queries in IR tasks tend to be short and noisy, we make use of Bidirectional LSTM with pooling (Tan et al., 2015) to obtain a more effective text representation from all the hidden states h 1:l . The sequence x is fed from left to right into LST M a and from right to left into LST M b . The new hidden state h τ ab at time step τ is obtained by concatenating the hidden states of LST M a and LST M b at their respective time step τ . Since max-pooling has been proven to be efficient in similar tasks (Tan et al., 2015), the latent representation z x of x can be formulated as: where x can be q s , q t or d t . N N dim is designed to adapt the output dimension and to allow further flexibility for representation learning.

Matching constraint and Translation constraint
Document ranking is the central problem in both monolingual IR and CLIR tasks. CLIR differs itself from its monolingual counterpart in that the language gap needs to be crossed prior to the retrieval process. Since the choice of translation strategies (query, document or both) affects the design of other components in our model, we will discuss the translation constraint in section 3.2.1 prior to matching constraints in sections 3.2.2 and 3.2.3.

Translation constraint
The translation constraint is developed to minimize the differences between a pair of parallel texts, which serves as a basic requirement in the translation scenario. Such a constraint directs the learning of language-invariant text representation for CLIR. We follow the arguments in previous studies (Vilares et al., 2016) and choose to translate queries in our model, since it is computationally expensive to translate large-scale document collections in practice. In this paper, we directly employ Google translator to translate queries, which is a popular choice for machine translation that leads to state-of-the-art translation performance. The translation constraint is then imposed on the embedding vectors z qs and z qt of the queries q s and q t . The translation loss L tra on a set QP of query pairs can be defined with the squared L2 norm, which is:

Cross-language matching constraint
The matching constraint captures essential characteristics of cross-language ranking. Following the practice in learning to rank (Liu, 2009), we model document ranking in the pairwise style where the relevance information is in the form of preferences between pairs of documents with respect to individual queries. In the model for CLIR, since we have matching signals from both monolingual text pairs and cross-language text pairs, the model can benefit from complementary knowledge from two resources. The monolingual pairwise matching constraint will be introduced in section 3.2.3. Similar to neural models in monolingual settings (Huang et al., 2013), the cross-language pairwise matching constraint is placed on top of the embedding vectors of source language query and target language documents. In figure 1, let us assume x qs has a relevant document x d t+ and an irrelevant document x d t− according to annotated text pairs. In training, the positive sample x d t+ for x qs can be chosen as the most relevant texts according to annotation, and the negative sample x d t− is picked randomly from the data collection. The cross-language matching constraint encourages the hidden representation of x d t+ to be near to the hidden representation of x qs in the semanticrich embedding space. Meanwhile, it asks the hidden representation of x d t− to be far from that of x qs . We follow typical neural IR models and make use of cosine as the distance measure of hidden vectors. The probability that d t+ is ranked higher than d t− given q s can be derived as: where σ is the sigmoid function with a hyperparameter β c controlling its shape. The crosslanguage matching loss L matc on cross-language triplet set QD c can be defined with cross-entropy loss as: where CE denotes the cross-entropy operator between two distributions and P (q s ) is the actual counterpart ofP (q s ) estimated from annotation with a strategy similar to that in (Dehghani et al., 2017).

Monolingual matching constraint
The monolingual matching constraint L matm can be built in a way similar to that of L matc . L matm is imposed on a set QD m of monolingual triplet (q t , d t+ , d t− ) as: where P (q t ) is the actual counterpart ofP (q t ) estimated from annotation.P (q t ) denotes the probability that d t+ is ranked higher than d t− given q t . It can be computed with the sigmoid function as: where β m is a hyper-parameter.

Embedding generator constraint
Since our model is implemented with adversarial framework, we propose to model the representation generator G, which embodies the process of language-invariant and task-specific embedding of queries and documents into a latent subspace, under a combination of three constraints introduced above. The translation constraint aims to guarantee language invariant when translating queries. The cross-language matching constraint explicitly captures cross-language ranking signals from cross-language text pairs. The monolingual matching constraint takes monolingual ranking into account so as to complement the crosslanguage ranking signals.
Combing the three constraints above, we obtain a comprehensive constraint that should be obeyed by the embedding generator process. With the regularization term L reg equaling to the sum of Frobenius norms of all weight matrices in the text embedding phase, we can write the embedding generator constraint L G as: where θ G denotes the set of parameters in the generator networks, and γ 1 , γ 2 , γ 3 are hyperparameters.

Adversarial constraint
We will introduce the adversarial constraint in this part. GAN (Goodfellow et al., 2014) simultaneously trains a generative model G and a disriminative model D in a competing way. G generates samples from a source of noise w that satisfies w ∼ P n (w) and tries to capture the real data distribution P r . D learns to distinguish between the generated samples from G and the true data sampled from P r (in practice, from training data). The training procedure for G is to try its best to fool D. Let us assume that G generates samples satisfying the distribution P g that is implicitly decided by G(w). The GAN value function V (G, D) on which D and G play the minmax game can be written as: (1) Theoretical analysis has indicated that playing the minmax game as above amounts to minimizing the Jensen-Shannon divergence between P g and P r . We follow the general idea of GAN and develop an adversarial component on top of the embedding space in figure 1. We note that GAN has been used in representation learning in a similar way as in (Bousmalis et al., 2016;Liu et al., 2017). In our model in figure 1, the adversarial component N N adv acts as the discriminator D which tries its best to detect whether the embedding vector z is encoded from x qt , x dt or x qs . In this paper, N N adv is implemented as a neural network with a softmax output layer. The output of N N adv then corresponds to a probability distribution vector over the input sources. Let us denote the ground truth label of the current input z to N N adv as l z which indicates the source that z is encoded from. We can adjust equation 1 to our settings and obtain the adversarial loss L adv on a query set Q t and a document set D t in the target language, as well as a query set Q s in the source language, which can be written as: where • is the inner product operator.

Training procedure
Following the training convention of GAN (Goodfellow et al., 2014), the process of learning the language-invariant and task-specific text representation for CLIR should be conducted by jointly minimizing the generator constraint L G and the adversarial loss L adv , which leads us to the combined objective function L as: According to the rule of playing the minmax game in GAN, G tries its best to maximize the probability that D makes a mistake and D tries its best to distinguish between real data and generated data (in our case, various input sources). The theoretical requirement behind GAN that D is maintained near its optimal solution as long as G changes slowly enough motivates us to update the discriminator part k steps per update of the generator part in the iterative optimization process. Based on these discussions, the minmax optimization process can be derived as: 1. Optimize D when fixing G through: 2. Optimize G when fixing D through: The optimization can be implemented with mini-batch gradient ascent (for θ D ) and descent (for θ G ).

Experiments and results
In this section, we conduct CLIR experiments so as to compare our text representation model with several other models.

CLIR evaluation sets
To perform CLIR experiments, we rely on broadly used data sets released in the bilingual tasks of the cross-language evaluation forum (CLEF) 2 . We choose to use the data from the year 2000 to 2004. Table 1 lists the characteristics of the data set, which include number of documents (N d ), number of distinct words (N w ), the average document length (DL avg ) and the number of queries (N q ) in each task. We use source language queries in French (Fr), German (De) and Italian (It) to retrieve target language documents in English (En).
Queries from year 2000 to 2002 are combined to a single task in table 1 since they have the same target set.

Training set
In order to train the representation learning model, we need to construct a data set consisting of annotated text pairs. We combine AOL queries (Pass et al., 2006) and a set of news titles downloaded from the news sites 3 to constitute training query set of diversity. Following the previous work (Gupta et al., 2017), we sample a balanced subset (1M) from such query set and use these queries to retrieve the data collection with BM25. For each training query, we take the top retrieved texts as positive samples, and the negative samples are selected randomly from the data collection. In addition to the pseudo-labeled text pairs of low quality, we combine the LETOR4.0 dataset (Qin and Liu, 2013) that is developed for evaluating learning to rank models. The LETOR4.0 dataset consists of relevance judgments of higher quality compared to pseudo-labeled data. The two data resources can complement each other in the training process.
In our experiments, the pseudo-labeled data is used to train the whole model and the LETOR dataset is employed to fine tune the parameters relevant to the source queries and target documents which are more important for the cross-language retrieval task. Those hyper-parameters are tuned on the validation set which is 20% of the training queries randomly selected.
For evaluation, we present results in terms of mean average precision (MAP). Statistically significant differences between various models are determined using the paired t-test with p < 0.05.

Baseline approaches
We make use of three categories of baselines for CLIR experiments.
1. Monolingual run (MON): a baseline with target language queries that are strictly parallel to source language queries.
2. Machine translation (MT): a baseline with target-language queries translated by machine translation system from sourcelanguage queries.
3. Cross-language text representation models: baselines that rely on continuous space models for cross-language text representation.

Results and analysis
4.3.1 Comparisons to state-of-the-art Table 2 lists the experimental results on CLEF dataset for our model (the column OURS) and all baseline models. There are three data collections and three language pairs, amounting to nine cross-language retrieval tasks. Except the strong baselines MON and MT, our model shows the best overall performance among all CLIR strategies. Indeed, our model outperforms all continuous space baselines (i.e., S2Net, BAE and XCNN) with statistical significance in almost all cases. Our model decreases slightly from the strong MT baseline in most retrieval tasks with only one degradation being significant on 03(De-En). Furthermore, one can find that our model approaches the monolingual baseline very much in all retrieval tasks with all MAP ratios around or over 90%. In our experiments, we have not performed comparisons to CL-LSA (Littman et al., 1998) and its variant OPCA (Platt et al., 2010), because they have been consistently outperformed by other CLIR strategies with a large margin (Schauble and Sheridan, 1997;Nie, 2010;Vulić et al., 2011). Among all continuous space baselines, the most recent model XCNN shows the best performance. XCNN always outperforms linear projection methods S2Net with significance. It also significantly outperforms the non-linear model BAE in all cases. This is coincident with previous conclusions in (Gupta et al., 2017) due to the fact that XCNN learns target-specific representation for CLIR but the other models do not. Our model also tries to learn task-specific representation for CLIR, which significantly outperforms XCNN in most cases according to the results in table 2. The reasons might be that (1) our method is modeled in a more effective adversarial learning framework.
(2) we explicitly capture crosslanguage ranking signals in embedding generator in addition to monolingual ranking signals used in XCNN.
(3) our model can jointly capture the translation knowledge and document ranking knowledge in a unified framework.

Variant of our model
Our model can be customized easily by altering the constraints to direct the representation learning process. Since the specificity of our model comes from the adversarial learning framework that has never been investigated in CLIR, we re-  move the constraint L adv from the original model M and obtain the variant M adv . In this case, M adv can be optimized with standard mini-batch gradient descent approach, without playing the minmax game. We redo above CLIR experiments with the same settings as above and obtain the retrieval results of M adv in table 3. From the results one can find that when removing the adversarial component from the original model, M adv decreases from the original model M in all retrieval tasks. The differences that are significant appear in 5 out of 9 retrieval tasks. The results demonstrate that learning generator and discriminator in a competing style within the adversarial learning framework leads to representation of higher quality, which eventually supports efficient CLIR. If we compare the variant M adv with the XCNN model in table 2, we find that M adv still performs better than XCNN in most cases. Such a comparison implicitly indicates that the joint exploitation of monolingual matching constraint, cross-language matching constraint and translation constraint in a single model is more efficient than using them separately as in the XCNN model.

Conclusions
In this paper, we propose a novel text representation approach for CLIR based on the adversarial learning framework. The learning framework is implemented as an interplay between an embedding generator process and an adversarial discriminator process, which leads to an optimal representation that is both language invariant and domain specific. The embedding generator is learned such that it explicitly considers both cross-language and monolingual pairwise ranking signals. In this way, it can ensure that the learned embeddings benefit from both sources and are directly optimized for CLIR. To the best of our knowledge, it is the first time adversarial learning has been applied to CLIR. Experiments on various language pairs in CLEF data collection show that our model is significantly better than other latent semantic models for CLIR. Indeed, our model approaches the performance of machine translation and monolingual baselines.