Contrastive Language Adaptation for Cross-Lingual Stance Detection

We study cross-lingual stance detection, which aims to leverage labeled data in one language to identify the relative perspective (or stance) of a given document with respect to a claim in a different target language. In particular, we introduce a novel contrastive language adaptation approach applied to memory networks, which ensures accurate alignment of stances in the source and target languages, and can effectively deal with the challenge of limited labeled data in the target language. The evaluation results on public benchmark datasets and comparison against current state-of-the-art approaches demonstrate the effectiveness of our approach.


Introduction
The rise of social media has enabled the phenomenon of "fake news," which could target specific individuals and can be used for deceptive purposes (Lazer et al., 2018;Vosoughi et al., 2018). As manual fact-checking is a time-consuming and tedious process, computational approaches have been proposed as a possible alternative (Popat et al., 2017;Wang, 2017;Mihaylova et al., 2018, based on information sources such as social media (Ma et al., 2017), Wikipedia (Thorne et al., 2018), and knowledge bases (Huynh and Papotti, 2018). Fact-checking is a multi-step process (Vlachos and Riedel, 2014): (i) checking the reliability of media sources, (ii) retrieving potentially relevant documents from reliable sources as evidence for each target claim, (iii) predicting the stance of each document with respect to the target claim, and finally (iv) making a decision based on the stances from (iii) for all documents from (ii).
Here, we focus on stance detection which aims to identify the relative perspective of a document with respect to a claim, typically modeled using labels such as agree, disagree, discuss, and unrelated.
Current approaches to stance detection (Bar-Haim et al., 2017;Dungs et al., 2018;Kochkina et al., 2018;Sobhani et al., 2017; are well-studied in mono-lingual settings, in particular for English, but less attention has been paid to other languages and cross-lingual settings. This is partially due to domain differences and to the lack of training data in other languages. We aim to bridge this gap by proposing a cross-lingual model for stance detection. Our model leverages resources of a source language (e.g., English) to train a model for a target language (e.g., Arabic). Furthermore, we propose a novel contrastive language adaptation approach that effectively aligns samples with similar or dissimilar stances across source and target languages using task-specific loss functions. We apply our language adaptation approach to memory networks (Sukhbaatar et al., 2015), which have been found effective for mono-lingual stance detection .
Our model can explain its predictions about stances of documents against claims in a different/target language by extracting relevant text snippets from the documents of the target language as evidence. We use evidence extraction as a measure to evaluate the trasferability of our model. This is because more accurate evidence extraction indicates that the model can better learn semantic relations between claims and pieces of evidence, and consequently can better transfer knowledge to the target language.
The contributions of this paper are summarized as follows: • We propose a novel language adaptation approach based on contrastive stance alignment that aligns the class labels between source and target languages for effective cross-lingual stance detection.
• Our model is able to extract accurate text snippets as evidence to explain its predictions in the target language (results are in Section 4.2).
• To the best of our knowledge, this is the first work on cross-lingual stance detection. We conducted our experiments on English (as source language) and Arabic (as target language). In particular, we used the Fake News Challenge dataset (Hanselowski et al., 2018) as source data and an Arabic benchmark dataset  as target data. The evaluation results have shown 2.7 and 4.0 absolute improvement in terms of macro-F1 and weighted accuracy for stance detection over the current state-of-the-art monolingual baseline, and 11.4, 14.9, 16.1, 12.9, and 13.1 points of absolute improvement in terms of precision at ranks 1-5 for extracting evidence snippets respectively. Furthermore, a key finding in our investigation is that, in contrast to other tasks (Devlin et al., 2019;Peters et al., 2018), pre-training with large amounts of source data is less effective for cross-lingual stance detection. We show that this is because pre-training can considerably bias the model toward the source language.

Method
Assume that we are given a training dataset for a source language, D s , which contains a set of triplets as follows: , where N is the number of source samples, (c s i , d s i ) is a pair of claim c s i and document d s i , and y s i ∈ Y , Y = {agree, disagree, discuss, unrelated}, is the corresponding label indicating the stance of the document with respect to the claim. In addition, we are given a very small training dataset for the target is a pair of claim and document in the target language with stance label y t i ∈ Y . In reality, (i) the size of the target dataset is very small, (ii) claims and documents in the source and target languages are from different domains, and (iii) the only commonality between the source and target datasets is in their stance labels, i.e., y s i , y t i ∈ Y . We develop a language adaptation approach to effectively use the commonality between the source and the target datasets in their label space and to deal with the limited size of the target training data. We apply our language adaptation approach to endto-end memory networks (Sukhbaatar et al., 2015) for cross-lingual stance detection.
We use memory networks as they have achieved state-of-the-art performance for mono-lingual stance detection . However, our language adaptation approach can be applied to any other type of neural network. The architecture of our cross-lingual stance detection model is shown in Figure 1. It has two main components: (i) Memory Networks indicated with two dashed boxes for the source and the target languages, and (ii) Contrastive Language Adaptation component.
In what follows, we first explain our memory network model for cross-lingual stance detection (Section 2.1) and then present our contrastive language adaptation approach (Section 2.2).

Memory Networks
Memory networks are designed to remember past information (Sukhbaatar et al., 2015) and have been successfully applied to NLP tasks ranging from dialog (Bordes et al., 2017) to question answering (Xiong et al., 2016) and mono-lingual stance detection . They include components that can potentially use different learning models and inference strategies. Our source and target memory networks follow the same architecture as depicted in Figure 1: A memory network consists of six components. The network takes as input a document d and a claim c and encodes them into the input space I. These representations are stored in the memory component M for future processing. The relevant parts of the input are identified in the inference component F , and used by the generalization component G to update the memory M . Finally, the output component O generates an output from the updated memory, and encodes it to a desired format in the response component R using a prediction function, e.g., softmax for classification tasks. We elaborate on these components below.
Input representation component I: It encodes documents and claims into corresponding representations. Each document d is divided into a sequence of paragraphs X = (x 1 , . . . , x l ), where each x j is encoded as m j using an LSTM network, and as n j using a CNN; these representations are stored in the memory component M . Note that while LSTMs are designed to capture and memorize their inputs (Tan et al., 2016), CNNs emphasize the local interaction between individual words in sequences, which is important for obtaining good representation (Kim, 2014). Thus, our I component uses both LSTM and CNN representations. It also uses separate LSTM and CNN with their own parameters to represent each input claim c as c lstm and c cnn , respectively.
We consider each paragraph as a single piece of evidence because a paragraph usually represents a coherent argument, unified under one or more interrelated topics. We thus use the terms paragraph and evidence interchangeably.
Inference component F : Our inference component computes LSTM-and CNN-based similarity between each claim c and evidence x j as follows: indicate claim-evidence similarity based on LSTM and CNN respectively, c lstm ∈ R q and m j ∈ R d are LSTM representations of c and x j respectively, c cnn ∈ R q and n j ∈ R d are the corresponding CNN representations, and M ∈ R q×d and M ∈ R q ×d are similarity matrices trained to map claims and paragraphs into the same space with respect to their LSTM and CNN representations. The rationale behind using these similarity matrices is that, in the memory network, we seek a transformation of the input claim, i.e., c × M, in order to obtain the closest evidence to the claim.
Additionally, we compute another semantic similarity vector, P tfidf j , by applying a cosine similarity between the TF.IDF (Spärck Jones, 2004) representation of x j and c. This is particularly useful for stance detection as it can help filtering out unrelated pieces of evidence.
Memory M and Generalization G components: Our memory component stores representations and the generalization component improves their quality by filtering out unrelated evidence. For example, the LSTM representations of paragraphs, m j , ∀j, are updated using the claim-evidence similarity P tfidf j as follows: m j = m j P tfidf j , ∀j. This transformation will help filter out unrelated evidence with respect to claims. The updated m j in conjunction with c lstm are used by the inference component F to compute P lstm j , ∀j as we explained above. Then, P lstm j are in turn used to update CNN representations in memory as follows: n j = n j P lstm j , ∀j. Finally, the updated n j and c cnn are used to compute P cnn j .
Output representation component O: This component computes the output of the memory M by concatenating the average vector of the updated n j with the maximum and average of claimevidence similarity vectors P tfidf j , P lstm j and P cnn j . The maximum helps to identify parts of documents that are most similar to claims, while the average estimates the overall document-claim similarity.
Response generation component R: This component computes the final stance of a document with respect to a claim. For this, the output of component O is concatenated with c lstm and c cnn and fed into a softmax to predict the stance of the document with respect to the claim.
All the memory network parameters, including those of CNN and LSTM in the I component, the similarity matrices M and M in F , and the classifier parameters in R, are jointly learned during the training process with our language adaptation.

Contrastive Language Adaptation
Memory networks are effective for stance detection in mono-lingual settings  when there is sufficient training data. However, we show that these models have limited transferability to target languages with limited data. This could be due to discrepancy between the underlying data distributions in the source and target languages. We show that the performance of these networks can be trivially increased when the model, pre-trained on source data, is fine-tuned using small amounts of labeled target data. We further develop contrastive language adaptation that can exploit the labeled source data to perform well on target data. Our contrastive adaptation approach: • encourages pairs (d s i , c s i ) from the source language and (d t i , c t i ) from the target language with the same stance labels (i.e., y s i = y t i ) to be nearby in the embedding space. We call this mapping Stance Equal Alignment (SEA), illustrated with dotted lines in Figure 2. Note that documents and claims in the two languages are often semantically different and are not corresponding translations of each other.
• encourages pairs (d s i , c s i ) from the source language and (d t i , c t i ) from the target language with different stance labels (i.e., y s i = y t i ) to be far apart in the embedding space. We call this mapping Stance Separation Alignment and claims c t i in the target language with y t i ∈ Y Output: : assign stance labels y t i ∈ Y to given unlabeled target pairs Cross-Lingual model: 1 where y = 1 if the source and the target have the same label, and y = 0 otherwise. 2 Loop for e epochs: 3 pass (d s i , c s i ) to the source memory network to create its representation r s i . 4 pass (d t j , c t j ) to the target memory network to create its representation r t j . 5 pass (r s i , y s i ) to the classification to compute its classification loss LCA s i . 6 pass (r s i , r t i , y i ) to the language adaptation to compute the stance alignment loss LCSA i . 7 compute total loss L s i = (1−α)LCA s i +αLCSA i . 8 repeat steps 2-7 with a change in step 5 by passing the target sample (r t j , y t j ) to the classification instead of the source sample, and compute its L t j in step 7. 9 jointly optimize all parameters of the model using the average loss L = mean({L s i } + {L t j }). We make complete use of the stance labels in the cross-lingual setting by parameterizing our model according to the distance between the source and the target samples in the embedding space. For stance equal alignment (SEA) constraint, the objective is to minimize the distance between pairs of source and target data with the same stance labels. We achieve this using the following loss: where g maps its input pair to an embedding space using our memory network or any mono-lingual model, and D computes Euclidean distance. For stance separation alignment (SSA), the goal is to maximize the distance between pairs with different stance labels. We use the following loss: where we maximize the distance between pairs with different stance labels up to the margin m.
The margin parameter m specifies the extent of separability in the embedding space. We can further use any classification loss to enforce classification alignment (CA). We use categorical cross-entropy and call it Classification Alignment loss L CA .
We develop our overall language adaptation loss, named Contrastive Stance Alignment loss, L CSA , by combining L SEA and L SSA as follows: Finally, the total loss of our cross-lingual stance detection model is defined as follows: where the α parameter controls the balance between classification and language adaptation losses, which we optimize on the validation dataset.
Information Flow: Our overall cross-lingual model for stance detection is shown in Figure 1, and a summary of the algorithm is presented in Table 1. As Figure 1 shows, each source and target pairs are passed to the source and to the target memory networks to obtain their corresponding representations (Lines 3-4 in Table 1). The source representation and its gold stance label are passed to the classifier to compute the classification loss (Line 5). In addition, the source and the target representations in conjunction with a binary parameter (y , which is 1 if the source and the target have the same stance label, and 0 otherwise) are passed to the language adaptation component to compute the contrastive stance alignment loss L CSA (Line 6). Finally, the total loss is computed based on Equation (4) (Line 7). The classifier also uses labeled target samples to create a shared embedding space and to fine-tune itself with respect to the target language. For this purpose, we repeat the above steps by switching the target and the source pipelines (Line 8). Finally, we compute the average of all losses and we use it to optimize the parameters of our model (Line 9).
Pre-training for Language Adaptation: Pretraining has been found effective in many language adaptation settings (Tzeng et al., 2017). To investigate the effect of pre-training, we first pre-train the source memory network and the classifier using D s (only the top pipeline in Figure 1), and then we apply language adaptation with the full model.

Experiments
Data and Settings. As source data, we use the Fake News Challenge dataset 1 which contains 75.4K claim-document pairs in English with {agree, disagree, discuss, unrelated} as stance labels. As target dataset, we use 3K Arabic claimdocument pairs developed in . 2 We perform 5-fold cross-validation on the Arabic dataset, using each fold in turn for testing, and keeping 80% of the remaining data for training and 20% for development. We use 300-dimensional pretrained cross-lingual Wikipedia word embeddings from MUSE (Lample et al., 2018). We use 300-dimensional units for the LSTM and 100 feature maps with filter width of 5 for the CNN. We consider the first 9 paragraphs per document, which is the median number of paragraphs in source documents. We optimize all hyper-parameters on validation data using Adam (Kingma and Ba, 2014).
Baselines. We consider the following baselines: • Heuristic: Given the imbalanced nature of our data, we use two heuristic baselines where all test examples are labeled as unrelated or agree. The former is a majority class baseline favoring accuracy and macro-F 1 , while the latter is better for weighted accuracy. • Gradient Boosting : This is a Gradient Boosting classifier with n-gram features as well as indicators for refutation and polarity.
Results. Table 2 shows the performance of all models on the target Arabic test set. The Allunrelated and All-agree baselines perform poorly across evaluation measures; All-unrelated performs better than All-agree because unrelated is the dominant class (∼ 68% of examples). Rows 3-6 show that Gradient Boosting and En-richedMLP yield similar results, while TFMLP performs the worst. We attribute this to the advanced features used in the two former models. Gradient Boosting has better accuracy due to its better performance on the dominant class. Note that Ensemble performs poorly because of the limited labeled data, which is insufficient to train a good CNN model. Rows 7-9 show the results for the mono-lingual memory network (MN) from . The performance of this model when trained on Arabic data only (row 7) is comparable to previous baselines (rows 3-6). But, it shows poor performance if trained on source English data and tested on Arabic test data (row 8). The model performs best (in terms of weighted accuracy and F1) if first pretrained on source data and then fine-tuned on target training data (row 9).
Row 10 in Table 2 shows the results for adversarial memory network (ADMN). It improves the performance of mono-lingual MN on weighted accuracy and F1, but its accuracy significantly drops. This is because adversarial approaches give higher weights to samples of the majority class (i.e., unrelated) which makes classification more challenging for the discriminator (Montahaei et al., 2018).
Row 11 shows the results for our cross-lingual memory network (CLMN) with (α = .7); α controls the balance between classification and language adaptation losses (tuned using validation data). CLMN outperforms other baselines in terms of weighted accuracy and F1 while showing comparable accuracy. We show that the improvement is due to language adaptation being able to effectively transfer knowledge from the source to target language (see Section 4.2).
The last column in Table 2 shows that unrelated examples are the easiest ones. Also, although the agree and the discuss classes have roughly the same size, i.e., 474 and 409 examples, respectively, the results for agree are notably higher. This is mainly because the documents that discuss a claim often share the same topic with the claim, but they do not take a stance. In addition, the disagree examples are the most difficult ones; this class is by far the smallest one, with only 87 examples.    Table 3 shows CLMN without pretraining (α = .7) performs better on target test data than CLMN with pretraining (α = .3), recall that α controls the balance between classification and language adaptation losses. Our further analysis shows that pretraining biases the model toward the source language. Figure 3 shows the impact of pretraining on macro-average F1 score for CLMN across different values of α on validation data. While the model without pretraining achieves its best performance with a large α (α = 0.7), the model with pretraining performs well with a smaller α (α = 0.3). This suggests that our model can capture the characteristics of the source dataset via pretraining when using small supervision from language adaptation (i.e., small α). However, pretraining introduces bias to the source space and the performance drops when larger weights are given to language adaptation; see the results with pretraining in Figure 3.

Assessment of Model Transferability
The improvements of CLMN model over the monolingual MN models that use the target only, the source only, or both the target and the source (rows 7-9 in Table 2 respectively) indicate its transferability. We further estimate transferability by measuring the accuracy of the models in extracting evidence that support their predictions. A more accurate model should better transfer knowledge to the target language by accurately learning the relations between claims and pieces of evidence. Our target data has annotations (in terms of binary labels) for each piece of evidence (here paragraph) that indicate whether it is a rationale for the agree or for the disagree class. Moreover, our inference component (I) has a claim-evidence similarity vector, P cnn j , which can be used to rank pieces of evidence from the target document against the target claim.
We use the gold data and the rankings produced by our model in order to measure its precision in extracting evidence that supports its predictions. Figure 4 shows that our CLMN model achieves precision of 40.2, 55.9, 66.0, 72.7, and 79.2 at ranks 1-5 respectively, and outperforms mono-lingual MN models. This indicates that CLMN can better generalize and transfer knowledge to the target language through learning relations between pieces of evidence and claims.

Effect of Language Adaptation
Figures 5a and 5b show the classification (L CA ) versus contrastive stance (L CSA ) losses obtained from our best language adaptation model (i.e., without pretraining) across training epochs and α values. The results are averaged on validation data when performing 5-fold cross-validation. As Figure 5a shows, there is greater reduction in the classification loss for smaller values of α, i.e., when classification loss contributes more to the overall loss; see Equation (4). On the other hand, Figure 5b shows that the CSA loss decreases with larger values of α as the model pays more attention to the CSA loss; see the red and green lines in Figure 5b. These results indicate that our language adaptation model can find a good balance between the classification loss and the CSA loss, with the value of α = .7 yielding the best performance.  Figure 6 compares the classification L CA , contrastive stance L CSA , and total L losses obtained by our CLMN model on the validation dataset across training epochs when the loss weight parameter (α) is set to its best value. Figures 6a and 6b show the results without and with pretraining for α = .7 and α = .3 respectively. Without pretraining (Figure 6a), the classification (light-blue line) and CSA (dark-blue line) losses both decrease up to epoch 20, after which the classification loss keeps decreasing, but the CSA loss starts increasing. With pretraining (Figure 6b), the CSA loss rapidly decreases for the first 10 epochs (even though it has a small effect as α = .3), and then continues with a smooth trend. This is because, during the initial training epochs, the model is biased to the source embedding space due to pretraining, and therefore the source and the target examples are far from each other. Then, our language adaptation model aligns the source and the target examples to form a much better shared embedding space and this alignment strategy yields a rapid decrease of the CSA loss in the first few epochs. Yet, in contrast to the CSA loss, the classification loss increases in the first few epochs. This is because the model enforces alignment between the source and the target samples due to the large distances. Finally, the total loss (orange line) indicates a good balance between the classification and the language adaptation losses, and it consistently decreases during training.

Related Work
Domain Adaptation. Previous work has presented several domain adaptation techniques. Unsupervised domain adaptation approaches (Ganin and Lempitsky, 2015;Long et al., 2016;Muandet et al., 2013;Gong et al., 2012) attempt to align the distribution of features in the embedding space mapped from the source and the target domains.
A limitation of such approaches is that, even with perfect alignment, there is no guarantee that the same-label examples from different domains would map nearby in the embedding space. Supervised domain adaptation (Daumé and Marcu, 2006;Becker et al., 2013;Bergamo and Torresani, 2010) attempts to encourage same-label examples from different domains to map nearby in the embedding space. While supervised approaches perform better than unsupervised ones, recent work (Motiian et al., 2017) has demonstrated superior performance by additionally encouraging class separation, meaning that examples from different domains and with different labels should be projected as far apart as possible in the embedding space. Here, we combined both types of alignments for cross-lingual stance detection.
Stance Detection. This is an important component for automatic fact-checking systems and veracity inference Zhang et al., 2019;. There have been some nuances in the way researchers have defined the stance detection task. Mohammad et al. (2016) and Zarrella and Marsh (2016) worked on stances regarding target propositions, e.g., entities, concepts or events, as in-favor, against, or neither. Most commonly, stance detection has been defined with respect to a claim as agree, disagree, discuss or unrelated (Hanselowski et al., 2018;Xu et al., 2018;. Previous work mostly developed the models with rich hand-crafted features such as words, word embeddings, and sentiment lexicons (Riedel et al., 2017;Baird et al., 2017;Hanselowski et al., 2018). More recently,  presented a mono-lingual and feature-light memory network for stance detection.
In this paper, we built on this work to extend previous efforts in stance detection to a cross-lingual setting, achieving the state-of-the-art result on the target language.

Conclusion and Future Work
We proposed an effective language adaptation approach to align class labels in source and target languages for accurate cross-lingual stance detection. Moreover, we investigated the behavior of our model in details and we have shown that it offers sizable performance gains over a number of competing approaches. In future, we will extend our language adaptation model to document retrieval and check-worthy claim detection tasks.