Cross-language Learning with Adversarial Neural Networks: Application to Community Question Answering

We address the problem of cross-language adaptation for question-question similarity reranking in community question answering, with the objective to port a system trained on one input language to another input language given labeled training data for the first language and only unlabeled data for the second language. In particular, we propose to use adversarial training of neural networks to learn high-level features that are discriminative for the main learning task, and at the same time are invariant across the input languages. The evaluation results show sizable improvements for our cross-language adversarial neural network (CLANN) model over a strong non-adversarial system.


Introduction
Developing natural language processing (NLP) systems that can work indistinctly with different input languages is a challenging task; yet, such a setup is useful for many real-world applications. One expensive solution is to annotate data for each input language and then to train a separate system for each one. Another option, which can be also costly, is to translate the input, e.g., using machine translation (MT), and then to work monolingually in the target language (Hartrumpf et al., 2008;Lin and Kuo, 2010;Ture and Boschee, 2016). However, the machine-translated text can be of low quality, might lose some input signal, e.g., it can alter sentiment (Mohammad et al., 2016), or may not be really needed (Bouma et al., 2008;Pouran Ben Veyseh, 2016). Using a unified cross-language representation of the input is a third, less costly option, which allows any combination of input languages during both training and testing.
In this paper, we take this last approach, i.e., combining languages during both training and testing, and we study the problem of question-question similarity reranking in community Question Answering (cQA), when the input question can be either in English or in Arabic, and the questions it is compared to are always in English. We start with a simple language-independent representation based on cross-language word embeddings, which we input into a feed-forward multilayer neural network to classify pairs of questions, (English, English) or (Arabic, English), regarding their similarity.
Furthermore, we explore the question of whether adversarial training can be used to improve the performance of the network when we have some unlabeled examples in the target language. In particular, we adapt the Domain Adversarial Neural Network model from (Ganin et al., 2016), which was originally used for domain adaptation, to our crosslanguage setting. To the best of our knowledge, this is novel for cross-language question-question similarity reranking, as well as for natural language processing (NLP) in general; moreover, we are not aware of any previous work on cross-language question reranking for community Question Answering.
In our setup, the basic task-solving network is paired with another network that shares the internal representation of the input and tries to decide whether the input example comes from the source (English) or from the target (Arabic) language. The training of this language discriminator network is adversarial with respect to the shared layers by using gradient reversal during backpropagation, which makes the training to maximize the loss of the discriminator rather than to minimize it. The main idea is to learn a high-level abstract representation that is discriminative for the main classification task, but is invariant across the input languages.
We apply this method to an extension of the SemEval-2016 Task 3, subtask B benchmark dataset for question-question similarity reranking (Nakov et al., 2016b). In particular, we hired professional translators to translate the original English questions to Arabic, and we further collected additional unlabeled questions in English, which we also got translated into Arabic. We show that using the unlabeled data for adversarial training allows us to improve the results by a sizable margin in both directions, i.e., when training on English and adapting the system with the Arabic unlabeled data, and vice versa. Moreover, the resulting performance is comparable to the best monolingual English systems at SemEval. We also compare our unsupervised model to a semi-supervised model, where we have some labeled data for the target language.
The remainder of this paper is organized as follows: Section 2 discusses some related work. Section 3 introduces our model for adversarial training for cross-language problems. Section 4 describes the experimental setup. Section 5 presents the evaluation results. Finally, Section 6 concludes and points to possible directions for future work.

Related Work
Below we discuss three relevant research lines: (a) adversarial training, (b) question-question similarity, and (c) cross-language learning.
Adversarial training of neural networks has shown a big impact recently, especially in areas such as computer vision, where generative unsupervised models have proved capable of synthesizing new images (Goodfellow et al., 2014;Radford et al., 2016;Makhzani et al., 2016). One crucial challenge in adversarial training is to find the right balance between the two components: the generator and the adversarial discriminator. Thus, several methods have been proposed recently to stabilize training (Metz et al., 2017;Arjovsky et al., 2017). Adversarial training has also been successful in training predictive models. More relevant to our work is the work of Ganin et al. (2016), who proposed domain adversarial neural networks (DANN) to learn discriminative but at the same time domain-invariant representations, with domain adaptation as a target. Here, we use adversarial training to learn task-specific representations in a cross-language setting, which is novel for this task, to the best of our knowledge.
Question-question similarity was part of Task 3 on cQA at SemEval-2016/2017 (Nakov et al., 2016b; there was also a similar subtask as part of SemEval-2016 Task 1 on Semantic Textual Similarity (Agirre et al., 2016). Question-question similarity is an important problem with application to question recommendation, question duplicate detection, community question answering, and question answering in general. Typically, it has been addressed using a variety of textual similarity measures. Some work has paid attention to modeling the question topic, which can be done explicitly, e.g., using a graph of topic terms (Cao et al., 2008), or implicitly, e.g., using LDA-based topic language model that matches the questions not only at the term level but also at the topic level (Zhang et al., 2014). Another important aspect is syntactic structure, e.g., Wang et al. (2009) (Jeon et al., 2005;Zhou et al., 2011). Unlike that work, here we are interested in cross-language adaptation for question-question similarity reranking. The problem was studied in (Martino et al., 2017) using cross-language kernels and deep neural networks; however, they used no adversarial training.

Adversarial Training for Cross-Language Problems
We demonstrate our approach for cross-language representation learning with adversarial training on a cross-lingual extension of the question-question similarity reranking subtask of SemEval-2016 Task 3 on community Question Answering. An example for the monolingual task is shown in Figure 1. We can see an original English input question q and a list of several potentially similar questions q i from the Qatar Living 1 forum, retrieved by a search engine. The original question (also referred to as a new question) asks about how to tip in Qatar. Question q 1 is relevant with respect to it as it asks the same thing, and so is q 2 , which asks how much one should tip in a specific situation. However, q 9 and q 10 are irrelevant: the former asks about what to wear at business meetings, and the latter asks about how to tip a kind of person who does not normally receive tips.
In our case, the input question q is in a different language (Arabic) than the language of the retrieved questions (English). The goal is to rerank a set of K retrieved questions {q k } K k=1 written in a source language (e.g., English) according to their similarity with respect to an input user question q that comes in another (target) language, e.g., Arabic. For simplicity, henceforth we will use Arabic as target and English as source. However, in principle, our method generalizes to any source-target language pair.

Unsupervised Language Adaptation
We approach the problem as a classification task, where given a question pair (q, q ), the goal is to decide whether the retrieved question q is similar (i.e., relevant) to q or not. Let c ∈ {0, 1} denote the class label: 1 for similar, and 0 for not similar. We use the posterior probability p(c = 1|q, q , θ) as a score for ranking all retrieved questions by similarity, where θ are the model parameters.
More formally, let R n = {q n,k } K k=1 denote the set of K retrieved questions for a new question q n . Note that the questions in R n are always in English. We consider a training scenario where we have labeled examples D S = {q n , q n,k , c n,k } N n=1 for English q n , but we only have unlabeled examples D T = {q n , q n,k } M n=N +1 for Arabic q n , with c n,k denoting the class label for the pair (q n , q n,k ). We want to train a cross-language model that can classify any test example {q n , q n,k }, where q n is in Arabic. This scenario is of practical importance, e.g., when an Arabic speaker wants to query the system in Arabic, and the database of related information is only in English. Here, we adapt the idea for adversarial training for domain adaptation as proposed by Ganin et al. (2016). Figure 2 shows the architecture of our crosslanguage adversarial neural network (CLANN) model. The input to the network is a pair (q, q ), which is first mapped to fixed-length vectors (z q , z q ). To generate these word embeddings, one can use existing tools such as word2vec (Mikolov et al., 2013) and monolingual data from the respective languages. Alternatively, one can use crosslanguage word embeddings, e.g., trained using the bivec model (Luong et al., 2015). The latter can yield better initialization, which could be potentially crucial when the labeled data is too small to train the input representations with the end-to-end system. The network then models the interactions between the input embeddings by passing them through two non-linear hidden layers, h and f . Additionally, the network considers pairwise features φ(q, q ) that go directly to the output layer, and also through the second hidden layer.
The following equations describe the transformations through the hidden layers: (1) where [.; .] denotes concatenation of two column vectors, U and V are the weight matrices in the first and in the second hidden layer, and g is a nonlinear activation function; we use rectified linear units or ReLU (Nair and Hinton, 2010). The pairwise features φ(q, q ) encode different types of similarity between q and q , and taskspecific properties that we describe later in Section 4. In our earlier work (Martino et al., 2017), we found it beneficial to use them directly to the output layer as well as through a hidden-layer transformation. The non-linear transformation allows us to learn high-level abstract features from the raw similarity measures, while the adversarial training, as we describe below, will make these abstract features language-invariant.
The output layer computes a sigmoid: where w are the output layer weights. We train the network by minimizing the negative log-probability of the gold labels: The network described so far learns the abstract features through multiple hidden layers that are discriminative for the classification task, i.e., similar vs. non-similar. However, our goal is also to make these features invariant across languages. To this end, we put a language discriminator, another neural network that takes the internal representation of the network f (see Equation 2) as input, and tries to discriminate between English and Arabic inputs -in our case, whether the input comes from D S or from D T .
The language discriminator is again defined by a sigmoid function: where l ∈ {0, 1} denotes the language of q (1 for English, and 0 for Arabic), w l are the final layer weights of the discriminator, and h l = g(U l f ) defines the hidden layer of the discriminator with U l being the layer weights and g being the ReLU activations.
We use the negative log-probability as the discrimination loss: The overall training objective of the composite model can be written as follows: where θ = {U, V, w}, ω = {U, V, w, U l , w l }, and the hyper-parameter λ controls the relative strength of the two networks.
In training, we look for parameter values that satisfy a min-max optimization criterion as follows: which involves a maximization (gradient ascent) with respect to {U l , w l } and a minimization (gradient descent) with respect to {U, V, w}. Note that maximizing L(U, V, w, U l , w l ) with respect to {U l , w l } is equivalent to minimizing the discriminator loss L l (ω) in Equation (6), which aims to improve the discrimination accuracy. In other words, when put together, the updates of the shared parameters {U, V, w} for the two classifiers work adversarially with respect to each other. In our gradient descent training, the above minmax optimization is performed by reversing the gradients of the language discrimination loss L l (ω), when they are backpropagated to the shared layers. As shown in Figure 2, the gradient reversal is applied to layer f and also to the layers that come before it.
Our optimization setup is related to the training method of Generative Adversarial Networks or GANs (Goodfellow et al., 2014), where the goal is to build deep generative models that can generate realistic images. The discriminator in GANs tries to distinguish real images from modelgenerated images, and thus the training attempts to minimize the discrepancy between the two image distributions, i.e., empirical as in the training data vs. model-based as produced by the generator. When backpropagating to the generator network, they consider a slight variation of the reverse gradients with respect to the discriminator loss. In particular, if ρ is the discriminator probability, instead of reversing the gradients of log(1 − ρ), they use the gradients of log ρ. Reversing the gradient is a different way to achieve the same goal.
Training. Algorithm 1 shows pseudocode for the algorithm we use to train our model, which is based on stochastic gradient descent (SDG). We first initialize the model parameters by using samples from glorot-uniform distribution (Glorot and Bengio, 2010). We then form minibatches of size b by randomly sampling b/2 labeled examples from D S and b/2 unlabeled examples from D T . For the labeled instances, both L c (θ) and L l (ω) losses are active, while only the L l (ω) loss is active for the unlabeled instances. 2 unlabeled examples from D T (c) Compute L c (θ) and L l (ω) (d) Take a gradient step for 2 b ∇ θ L c (θ) (e) Take a gradient step for 2λ b ∇ U l ,w l L l (ω) // Gradient reversal (f) Take a gradient step for − 2λ b ∇ θ L l (ω) until convergence; As mentioned above, the main challenge in adversarial training is to balance the two components of the network. If one component becomes smarter, its loss to the shared layer becomes useless, and the training fails to converge (Arjovsky et al., 2017). Equivalently, if one component gets weaker, its loss overwhelms that of the other, causing training to fail. In our experiments, the language discriminator was weaker. This could be due to the use of cross-language word embeddings to generate input embedding representations for q and q . To balance the two components, we would want the error signals from the discriminator to be fairly weak initially, with full power unleashed only as the classification errors start to dominate. We follow the weighting schedule proposed by Ganin et al. (2016, p. 21), who initialize λ to 0, and then change it gradually to 1 as training progresses. I.e., we start training the task classifier first, and we gradually add the discriminator's loss.

Semi-supervised Extension
Above we considered an unsupervised adaptation scenario, where we did not have any labeled instance for the target language, i.e., when the new question q n is in Arabic. However, our method can be easily generalized to a semi-supervised setting, where we have access to some labeled instances in the target language, D T * = {q n , R n , c n } L n=M +1 . In this case, each minibatch during training is formed by labeled instances from both D S and D T * , and unlabeled instances from D T .  Table 1: Performance on the test set for our cross-language systems, with and without adversarial adaptation (CLANN and FNN, respectively), and for both language directions (en-ar and ar-en). The prime notation under the Discrim. column represents using a counterpart from the unlabeled data.

Experimental Setting
In this section, we describe the datasets we used, the generation of the input embeddings, the nature of the pairwise features, and the general training setup of our model.

Datasets
SemEval-2016 Task 3 (Nakov et al., 2016b), provides 267 input questions for training, 50 for development, and 70 for testing, and ten times as many potentially related questions retrieved by an IR engine for each input question: 2,670, 500, and 700, respectively. Based on this data, we simulated a cross-language setup for question-question similarity reranking. We first got the 387 original train+dev+test questions translated into Arabic by professional translators. Then, we used these Arabic questions as an input with the goal to rerank the ten related English questions. As an example, this is the Arabic translation of the original English question from Figure 1: We further collected 221 additional original questions and 1,863 related questions as unlabeled data, and we got the 221 English questions translated to Arabic. 2

Cross-language Embeddings
We used the TED (Abdelali et al., 2014) and the OPUS parallel Arabic-English bi-texts (Tiedemann, 2012) to extract a bilingual dictionary, and to learn cross-language embeddings. We chose these bi-texts as they are conversational (TED talks and movie subtitles, respectively), and thus informal, which is close to the style of our community question answering forum.
We trained Arabic-English cross-language word embeddings from the concatenation of these bitexts using bivec (Luong et al., 2015), a bilingual extension of word2vec, which has achieved excellent results on semantic tasks close to ours (Upadhyay et al., 2016). In particular, we trained 200dimensional vectors using the parameters described in (Upadhyay et al., 2016), with a context window of size 5 and iterating for 5 epochs. We then compute the representation for a question by averaging the embedding vectors of the words it contains. Using these cross-language embeddings allows us to compare directly representations of an Arabic or an English input question q to English potentially related questions q i .

Pairwise Features
In addition to the embeddings, we also used some pairwise features that model the similarity or some other relation between the input question and the potentially related questions. 3 These features were proposed in the previous literature for the questionquestion similarity problem, and they are necessary to obtain state-of-the-art results.
We further used as features the cosine similarity between question embeddings. In particular, we used (i) 300-dimensional pre-trained Google News embeddings from (Mikolov et al., 2013), (ii) 100dimensional embeddings trained on the entire Qatar Living forum (Mihaylov and Nakov, 2016), and (iii) 25-dimensional Stanford neural parser embeddings (Socher et al., 2013). The latter are produced by the parser internally, as a by-product.
Furthermore, we computed various task-specific features, most of them introduced in the 2015 edition of the SemEval task by (Nicosia et al., 2015;. This includes some question-level features: (1) number of URLs/images/emails/phone numbers; (2) number of tokens/sentences; (3) average number of tokens; (4) type/token ratio; (5) number of nouns/verbs/adjectives/adverbs/ pronouns; (6) number of positive/negative smileys; (7) number of single/double/ triple exclamation/interrogation symbols; (8) number of interrogative sentences (based on parsing); (9) number of words that are not in WORD2VEC's Google News vocabulary. Also, some question-question pair features: (10) count ratio in terms of sentences/tokens/nouns/verbs/ adjectives/adverbs/pronouns; (11) count ratio of words that are not in WORD2VEC's Google News vocabulary. Finally, we also have one meta feature: (12) reciprocal rank of the related question in the list of related questions.

Model settings
We trained our CLANN model by optimizing the objective in Equation (7) using ADAM (Kingma and Ba, 2015) with default parameters. For this, we used up to 200 epochs. In order to avoid overfitting, we used dropout (Srivastava et al., 2014) of hidden units, l 2 regularization on weights, and early stopping by observing MAP on the development dataset -if MAP did not increase for 15 consecutive epochs, we exited with the best model recorded so far. We optimized the values of the hyper-parameters using grid search: for minibatch (b) size in {8, 12, 16}, for dropout (d) rate in {0.2, 0.3, 0.4, 0.5}, for h layer size in {10, 15, 20}, for f layer size in {75, 100, 125}, and for l 2 strength in {0.01, 0.02, 0.03}. The fifth column in Table 1 shows the optimal hyper-parameter setting for the different models. Finally, we used the best model as found on the development dataset for the final evaluation on the test dataset.

Evaluation Results
Below we present the experimental results for the unsupervised and semi-supervised language adaptation settings. We compare our cross-language adversarial network (CLANN) to a feed forward neural network (FNN) that has no adversarial part. Table 1 shows the main results for our crosslanguage adaptation experiments. Rows 1-2 present the results when the target language is Arabic and the system is trained with English input. Rows 3-4 show the reverse case, i.e., adaptation into English when training on Arabic. FNN stands for feed-forward neural network, and it is the upper layer in Figure 2, excluding the language discriminator. CLANN is the full cross-language adversarial neural network, training the discriminator with English inputs paired with random Arabic related questions from the unlabeled dataset. We show three ranking-oriented evaluation measures that are standard in the field of Information Retrieval: mean average precision (MAP), mean reciprocal rank (MRR), and average recall (AvgRec).

Unsupervised Adaptation Experiments
We computed them using the official scorer from SemEval-2016 Task 3. 4 Similarly to that task, we consider Mean Average Precision (MAP) as the main evaluation metric. The table also presents, for reproducibility, the values of the neural network hyper-parameters after tuning (in the fifth column).
We can see that the MAP score for FNN with Arabic target is 75.28. When doing the adversarial adaptation with the unlabeled Arabic examples (CLANN), the MAP score is boosted to 76.64 (+1.36 points). Going in the reverse direction, with English as the target, yields very comparable results: MAP goes from 75.32 to 76.70 (+1.38). To put these results into perspective, Table 2 shows the results for the top-2 best-performing systems from SemEval-2016 Task 3, which used a monolingual English setting. We can see that our FNN approach based on cross-language input embeddings is already not far from the best systems. Yet, when we consider the full adversarial network, in any of the two directions, we get performance that is on par with the best, in all metrics.
We conclude that the adversarial component in the network does the expected job, and improves the performance by focusing the languageindependent features in the representation layer. The scatter plots in Figure 3 are computed by projecting the representation layer vectors of the first 500 test examples into two dimensions using t-SNE visualization (van der Maaten and Hinton, 2008). The first 250 are taken with Arabic input (blue), the second 250 are taken with English input (red). 0-1 are the class labels (similar vs. non-similar). The top plot corresponds to CLANN training with English and adapting with Arabic examples, while the second one covers the opposite direction. The plots look as expected. CLANN really mixes the blue and the red examples, as the adversarial part of the network pushes for learning shared abstract features that are language-insensitive. At the same time, the points form clusters with clear majorities of 0s or 1s, as the supervised part of the network learns how to classify them in these classes.

Semi-supervised Experiments
We now study the semi-supervised scenario when we also have some labeled data from the target language, i.e., where the original question q is in the target language. This can be relevant in practical situations, as sometimes we might be able to annotate some data in the target language. It is also an exploration of training with data in multiple languages all together.
To simulate this scenario, we split the training set in two halves. We train with one half as the source language, and we use the other half with the target language as extra supervised data. At the same time, we also use the unlabeled examples as before. We introduced the semi-supervised model in subsection 3.2, which is a straightforward adaptation of the CLANN model. Table 3 shows the main results of our crosslanguage semi-supervised experiments. The table is split into two blocks by source and target language (en-ar or ar-en). We also use the same notation as in Table 1. The suffixes -unsup and -semisup indicate whether CLANN is trained in unsupervised mode (same as in Table 1) or in semisupervised mode. The language discriminator in this setting is trained to discriminate between labeled source and labeled target examples, and labeled source and unlabeled target examples. This is indicated in the Discrim. column using asterisk and prime symbols, respectively.  Table 3: Semi-supervised experiments, when training on half of the training dataset, and evaluating on the full testing dataset. Shown is the performance of our cross-language models, with and without adversarial adaptation (i.e., using CLANN and FNN, respectively), using the unsupervised and the semi-supervised settings, and for both language directions: English-Arabic and Arabic-English. The prime notation in the Discrim. column represents choosing a counterpart for the discriminator from the unlabeled data. The asterisks stand for choosing an unpaired labeled example from the other half of the training dataset.
There are several interesting observations that we can make about Table 3. First, since here we are training with only 50% of the original training data, both FNN and CLANN-unsup yield lower results compared to before, i.e., compared to Table 1; this is to be expected. However, the unsupervised adaptation, i.e., using the CLANN-unsup model, still yields improvements over the FNN model by a sizable margin, according to all three evaluation measures. When we also train using the additional labeled examples in the target language, i.e., using the CLANN-semisup model, the results are boosted again to a final MAP score that is very similar to what we had obtained before with the full source-language training dataset. In the English into Arabic adaptation, the MAP score jumps from 74.69 to 76.65 (+1.96 points) when going from the FNN to the CLANN-semisup model, the MRR score goes from 83.79 to 84.52 (+0.73), and the AvgRec score is boosted from 88.16 to 90.84 (+2.68). The results in the opposite adaptation direction, i.e., from Arabic into English, follow a very similar pattern.
These results demonstrate the effectiveness and the flexibility of our general adversarial training framework within our CLANN architecture when applied to a cross-language setting for questionquestion similarity, taking advantage of the unlabeled examples in the target language (i.e., when using unsupervised adaptation) and also taking advantage of any labeled examples in the target language that we may have at our disposal (i.e., when using semi-supervised training with input examples in the two languages simultaneously).

Conclusion
We have studied the problem of cross-language adaptation for the task of question-question similarity reranking in community question answering, when the input question can be either in English or in Arabic with the objective to port a system trained on one input language to another input language given labeled data for the source language and only unlabeled data for the target language. We used a discriminative adversarial neural network, which we trained to learn task-specific representations directly. This is novel in a cross-language setting, and we have shown that it works quite well. The evaluation results have shown sizable improvements over a strong neural network model that uses simple projection with cross-language word embeddings.
In future work, we want to extend the present research in several directions. For example, we would like to start with monolingual word embeddings and to try to learn the shared cross-language representation directly as part of the end-to-end training of our neural network. We further plan to try LSTM and CNN for generating the initial representation of the input text (instead of simple averaging of word embeddings). We also want to experiment with more than two languages at a time. Another interesting research direction we want to explore is to try to adapt our general CLANN framework to other tasks, e.g., to answer ranking in community Question Answering (Joty et al., 2016;Nakov et al., 2016a) in a cross-language setting, as well as to cross-language representation learning for words and sentences.