Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher

Cross-lingual text classification alleviates the need for manually labeled documents in a target language by leveraging labeled documents from other languages. Existing approaches for transferring supervision across languages require expensive cross-lingual resources, such as parallel corpora, while less expensive cross-lingual representation learning approaches train classifiers without target labeled documents. In this work, we propose a cross-lingual teacher-student method, CLTS, that generates “weak” supervision in the target language using minimal cross-lingual resources, in the form of a small number of word translations. Given a limited translation budget, CLTS extracts and transfers only the most important task-specific seed words across languages and initializes a teacher classifier based on the translated seed words. Then, CLTS iteratively trains a more powerful student that also exploits the context of the seed words in unlabeled target documents and outperforms the teacher. CLTS is simple and surprisingly effective in 18 diverse languages: by transferring just 20 seed words, even a bag-of-words logistic regression student outperforms state-of-the-art cross-lingual methods (e.g., based on multilingual BERT). Moreover, CLTS can accommodate any type of student classifier: leveraging a monolingual BERT student leads to further improvements and outperforms even more expensive approaches by up to 12% in accuracy. Finally, CLTS addresses emerging tasks in low-resource languages using just a small number of word translations.


Introduction
The main bottleneck in using supervised learning for multilingual document classification is the high cost of obtaining labeled documents for all of the target languages. To address this issue in a target language L T , we consider a cross-lingual text clas- sification approach that requires labeled documents only in a source language L S and not in L T . Existing approaches for transferring supervision across languages rely on large parallel corpora or machine translation systems, which are expensive to obtain and are not available for many languages. 1 To scale beyond high-resource languages, multilingual systems have to reduce the cross-lingual requirements and operate under a limited budget of cross-lingual resources. Such systems typically ignore target-language supervision, and rely on feature representations that bridge languages, such as cross-lingual word embeddings (Ruder et al., 2019) or multilingual transformer models (Wu and Dredze, 2019;Pires et al., 2019). This general approach is less expensive but has a key limitation: by not considering labeled documents in L T , it may fail to capture predictive patterns that are specific to L T . Its performance is thus sensitive to the quality of pre-aligned features (Glavaš et al., 2019).
In this work, we show how to obtain weak supervision for training accurate classifiers in L T without using manually labeled documents in L T or expensive document translations. We propose a novel approach for cross-lingual text classification that transfers weak supervision from L S to L T using minimal cross-lingual resources: we only require a small number of task-specific keywords, or seed words, to be translated from L S to L T . Our core idea is that the most indicative seed words in L S often translate to words that are also indicative in L T . For instance, the word "wonderful" in English indicates positive sentiment, and so does its translation "magnifique" in French. Thus, given a limited budget for word translations (e.g., from a bilingual speaker), only the most important seed words should be prioritized to transfer task-specific information from L S to L T .
Having access only to limited cross-lingual resources creates important challenges, which we address with a novel cross-lingual teacher-student method, CLTS (see Figure 1).
Efficient transfer of supervision across languages: As a first contribution, we present a method for cross-lingual transfer in low-resource settings with a limited word translation budget. CLTS extracts the most important seed words using the translation budget as a sparsity-inducing regularizer when training a classifier in L S . Then, it transfers seed words and the classifier's weights across languages, and initializes a teacher classifier in L T that uses the translated seed words.
Effective training of classifiers without using any labeled target documents: The teacher, as described above, predicts meaningful probabilities only for documents that contain translated seed words. Because translations can induce errors and the translation budget is limited, the translated seed words may be noisy and not comprehensive for the task at hand. As a second contribution, we extend the "weakly-supervised co-training" method of Karamanolakis et al. (2019) to our cross-lingual setting for training a stronger student classifier using the teacher and unlabeled-only target documents. The student outperforms the teacher across all languages by 59.6%.
Robust performance across languages and tasks: As a third contribution, we empirically show the benefits of generating weak supervision in 18 diverse languages and 4 document classification tasks. With as few as 20 seed-word translations and a bag-of-words logistic regression student, CLTS outperforms state-of-the-art methods relying on more complex multilingual models, such as multilingual BERT, across most languages. Using a monolingual BERT student leads to further improvements and outperforms even more expensive approaches ( Figure 2). CLTS does not require cross-lingual resources such as parallel corpora, machine translation systems, or pre-trained multilingual language models, which makes it applicable in low-resource settings. As a preliminary exploration, we address medical emergency situation detection in Uyghur and Sinhalese with just 50 translated seed words per language, which could be easily obtained from bilingual speakers.
The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 defines our problem of focus. Section 4 presents our crosslingual teacher-student approach. 2 Section 5 describes our experimental setup and results. Finally, Section 6 concludes and suggests future work.

Related Work
Relevant work spans cross-lingual text classification and knowledge distillation.

Cross-Lingual Text Classification
We focus on a cross-lingual text classification scenario with labeled data in the source language L S and unlabeled data in the target language L T . We review the different types of required cross-lingual resources, starting with the most expensive types.
Annotation Projection and Machine Translation. With parallel corpora (i.e., corpora where each document is written in both L S and L T ), a classifier trained in L S predicts labels for documents in L S and its predictions are projected to documents in L T to train a classifier in L T (Mihalcea et al., 2007;Rasooli et al., 2018). Unfortunately, parallel corpora are hard to find, especially in low-resource domains and languages.
Without parallel corpora, documents can be translated using machine translation (MT) systems (Wan, 2008(Wan, , 2009Salameh et al., 2015;Mohammad et al., 2016). However, high-quality MT systems are limited to high-resource languages. Even when an MT system is available, translations may change document semantics and degrade classification accuracy (Salameh et al., 2015).

Cross-Lingual Representation
Learning. Other approaches rely on less expensive resources to align feature representations across languages, typically in a shared feature space to enable cross-lingual model transfer.
Recently, multilingual transformer models were pre-trained in multiple languages in parallel using language modeling objectives (Devlin et al., 2019;Conneau and Lample, 2019). Multilingual BERT, a version of BERT (Devlin et al., 2019) that was trained on 104 languages in parallel without using any cross-lingual resources, has received significant attention (Karthikeyan et al., 2019;Singh et al., 2019;Rogers et al., 2020). Multilingual BERT performs well on zero-shot cross-lingual transfer (Wu and Dredze, 2019;Pires et al., 2019) and its performance can be further improved by considering target-language documents through self-training (Dong and de Melo, 2019). In contrast, our approach does not require multilingual language models and sometimes outperforms multilingual BERT using a monolingual BERT student.

Knowledge Distillation
Our teacher-student approach is related to "knowledge distillation" (Buciluǎ et al., 2006;Ba and Caruana, 2014;Hinton et al., 2015), where a student classifier is trained using the predictions of a teacher classifier. Xu and Yang (2017) apply knowledge distillation for cross-lingual text classification but require expensive parallel corpora. MultiFiT (Eisenschlos et al., 2019) trains a classifier in L T using the predictions of a cross-lingual model, namely, LASER (Artetxe and Schwenk, 2019), that also requires large parallel corpora. Vyas and Carpuat (2019) classify the semantic relation (e.g., synonymy) between two words from different languages by transferring all training examples across languages. Our approach addresses a different problem, where training examples are full documents (not words), and transferring source training documents would require MT. Related to distillation is the semi-supervised approach of Shi et al. (2010) that trains a target classifier by transferring a source classifier using high-coverage dictionaries. Our approach is similar, but trains a classifier using sparsity regularization, and translates only the most important seed words.

Problem Definition
Consider a source language L S , a target language L T , and a classification task with K predefined classes of interest Y = {1, . . . , K} (e.g., sentiment categories). Labeled documents i is a sequence of words from the target vocabulary V T . We assume that there is no significant shift in the conditional distribution of labels given documents across languages. Furthermore, we assume a limited translation budget, so that up to B words can be translated from L S to L T .
Our goal is to use the labeled source documents D S , the unlabeled target documents D T , and the translations of no more than B source words to train a classifier that, given an unseen test document x T i in the target language L T , predicts the corresponding label y i ∈ Y.

Cross-Lingual Teacher-Student
We now describe our cross-lingual teacher-student method, CLTS, for cross-lingual text classification. Given a limited budget of B translations, CLTS extracts only the B most important seed words in L S (Section 4.1). Then, CLTS transfers the seed words and their weights from L S to L T , to initialize a classifier in L T (Section 4.2). Using this classifier as a teacher, CLTS trains a student that predicts labels using both seed words and their context in target documents (Section 4.3).

Seed-Word Extraction in L S
CLTS starts by automatically extracting a set G S k of indicative seed words per class k in L S . Previous extraction approaches, such as tf-idf variants (Angelidis and Lapata, 2018), have been effective in monolingual settings with limited labeled data. In our scenario, with sufficiently many labeled source documents and a limited translation budget B, we propose a different approach based on a supervised classifier trained with sparsity regularization.
Specifically, CLTS extracts seed words from the weights W ∈ R K×|V S | of a classifier trained using D S . Given a source document x S i with a bag-of-words encoding h S i ∈ R |V S | , the classifier predicts class probabilities p i = p 1 i , . . . , p K i = softmax(W h i ). CLTS includes the word v c ∈ V S in G S k if the classifier considers it to increase the probability p k i through a positive weight W kc : The set of all source seed words G S = G S 1 ∪ · · · ∪ G S K may be much larger than the translation budget B. We encourage the classifier to capture only the most important seed words during training through sparsity regularization: (2) where L is the training loss function (logistic loss), R sparse (.) is a sparsity regularizer (L1 norm), and λ B ∈ R is a hyperparameter controlling the relative power of R sparse . Higher λ B values lead to sparser matricesŴ and thus to fewer seed words.  Therefore, we tune 3 λ B to be as high as possible while at the same time leading to the extraction of at least B seed words. After training, G S consists of the B seed words with highest weight.

Cross-Lingual Seed Weight Transfer
We now describe our cross-lingual transfer method. CLTS transfers both translated seed words and their learned weights to initialize a "weak" classifier in L T that considers translated seed words and their relative importance for the target task. Specifically, CLTS first translates the B seed words in G S into a set G T with seed words in L T . Then, for each translation pair (v S , v T ), CLTS transfers the column for v S inŴ to a corresponding column for v T in a K × |V T | matrixẐ: Importantly, for each word, we transfer the weights for all classes (instead of just a single weightŴ kc ) across languages. Therefore, without using any labeled documents in L T , CLTS constructs a classifier that, given a test document x T j in L T , predicts class probabilities q j = q 1 j , . . . , q K j : where h T j ∈ R |V T | is a bag-of-words encoding for x T j andẑ k is the k-th row ofẐ. Note that columns ofẐ for non-seed words in V T are all zeros and thus this classifier predicts meaningful probabilities only for documents with seed words in G T .

Teacher-Student Co-Training in L T
We now describe how CLTS trains a classifier in L T that leverages indicative features, which may not be captured by the small set of translated seed words. As illustrated in Figure 3, translated seed words (e.g., "parfait") often co-occur with other words (e.g., "aime," meaning "love") that have zero weight inẐ but are also helpful for the task at hand. To exploit such words in the absence of labeled target documents, we extend the monolingual weaklysupervised co-training method of Karamanolakis et al. (2019) to our cross-lingual setting, and use our classifier based on translated seed words as a teacher to train a student, as we describe next. First, CLTS uses our classifier from Equation 5 as a teacher to predict labels q j for unlabeled documents x T j ∈ D T that contain seed words: Note that our teacher with weights transferred across languages is different than that of Karamanolakis et al. (2019), which simply "counts" seed words.
Next, CLTS trains a student f T that also exploits the context of the seed words. Given a document x T j in L T , the student predicts class probabilities: where the predictor function f T with weight parameters θ can be of any type, such as a pre-trained transformer-based classifier that captures languagespecific word composition. The student is trained via the "distillation" objective: where H(q, r) = − k q k log r k is the cross entropy between student's and teacher's predictions, R(.) is a regularizer (L2 norm), and λ ∈ R is a hyperparameter controlling the relative power of R. Importantly, through extra regularization (R, dropout) the student also associates non-seed words with target classes, and generalizes better than the teacher by making predictions even for documents that do not contain any seed words.
Then, CLTS uses the student in place of the teacher to annotate all M unlabeled examples in . While in the first iteration D T contains only documents with seed words, in the second iteration CLTS adds in D T all unlabeled documents to create a Algorithm 1 Cross-Lingual Teacher-Student larger training set for the student. This also differs from Karamanolakis et al. (2019), which updates the weights of the initial seed words but does not provide pseudo-labels for documents with no seed words. This change is important in our crosslingual setting with a limited translation budget, where the translated seed words G T may only cover a very small subset D T of D T . Algorithm 1 summarizes the CLTS method for cross-lingual classification by translating B seed words. Iterative co-training converges when the disagreement between the student's and teacher's hard predictions on unlabeled data stops decreasing. In our experiments, just two rounds of co-training are generally sufficient for the student to outperform the teacher and achieve competitive performance even with a tight translation budget B.

Experiments
We now evaluate CLTS for several cross-lingual text classification tasks in various languages.
Experimental Procedure: We use English as the source language, where we train a source classifier and extract B seed words using labeled documents (Section 4.1). Then, we obtain translations for B ≤ 500 English seed words using the MUSE 4 bilingual dictionaries (Lample et al., 2018). For Uyghur and Sinhalese, which have no entries in MUSE, we translate seed words through Google Translate. 5 The appendix reports additional seedword translation details. We do not use labeled documents in the target language for training (Section 3). We report both the teacher's and student's performance in L T averaged over 5 different runs. We consider any test document that contains no seed words as a "mistake" for the teacher.
Model Configuration: For the student, we experiment with a bag-of-ngrams (n = 1, 2) logistic regression classifier (LogReg) and a linear classifier using pre-trained monolingual BERT embeddings as features (MonoBERT; Devlin et al. (2019)). The appendix lists the implementation details. We do not optimize any hyperparameters in the target language, except for B, which we vary between 6 and 500 to understand the impact of translation budget on performance. CLS does not contain validation sets, so we fix B = 20 and translate 10 words for each of the two sentiment classes. More generally, to cover all classes we extract and translate B K seed words per class. We perform two rounds of teacherstudent co-training, which provided most of the improvement in Karamanolakis et al. (2019).
Model Comparison: For a robust evaluation of CLTS, we compare models with different types of cross-lingual resources. Project-* uses the parallel LDC or EuroParl (EP) corpora for annotation projection (Rasooli et al., 2018). LASER uses mil-4 https://github.com/facebookresearch/ MUSE#ground-truth-bilingual-dictionaries 5 Google Translate started supporting Uyghur on February 26, 2020, and Sinhalese at an earlier (unknown) time.
MultiFiT uses LASER to create pseudo-labels in L T (Eisenschlos et al., 2019) and trains a classifier in L T based on a pre-trained language model (Howard and Ruder, 2018). CLWE-par uses parallel corpora to train CLWE (Rasooli et al., 2018). MT-BOW uses Google Translate to translate test documents from L T to L S and applies a bag-of-words classifier in L S (Prettenhofer and Stein, 2010). BiDRL uses Google Translate to translate documents from L S to L T and L T to L S (Zhou et al., 2016). CLDFA uses task-specific parallel corpora for cross-lingual distillation (Xu and Yang, 2017). SentiWordNet uses bilingual dictionaries with over 20K entries to transfer the SentiWordNet03 (Baccianella et al., 2010) to the target language and applies a rule-based heuristic (Rasooli et al., 2018). CLWE-Wikt uses bilingual dictionaries with over 20K entries extracted from Wiktionary 6 to create CLWE for training a bidirectional LSTM classifier (Rasooli et al., 2018). MultiCCA uses bilingual dictionaries with around 20K entries to train CLWE (Ammar et al., 2016), trains a convolutional neural network (CNN) in L S and applies it in L T (Schwenk and Li, 2018). CL-SCL obtains 450 word translations as "pivots" for cross-lingual domain adaptation (Prettenhofer and Stein, 2010). Our CLTS approach uses B word translations not for domain adaptation but to create weak supervision in L T through the teacher (Teacher) for training the student (Student-LogReg or Student-MonoBERT). VECMAP uses identical strings across languages as a weak signal to train CLWE (Artetxe et al., 2017). MultiBERT uses multilingual BERT to train a classifier in L S and applies it in L T (Wu and Dredze, 2019) without considering labeled documents in L T (zeroshot setting). ST-MultiBERT further considers labeled documents in L T for fine-tuning multilingual BERT through self-training (Dong and de Melo, 2019). The appendix discusses more comparisons. Figure 4 shows results for each classification task and language. The rightmost column of each table reports the average performance across all languages (and domains for CLS). For brevity, we report the average performance across the three review domains (Books, DVD, Music) for each lan- . Surprisingly, in several languages CLTS outperforms even more expensive approaches that rely on parallel corpora or machine translation systems (LASER, MultiFiT, MT-BOW, BiDRL, CLDFA, CLWE-BW, Project-LDC).

Experimental Results
CLTS is effective under a minimal translation budget. Figure 5 shows CLTS's performance as a function of the number of seed words per class ( B K ). Even with just 3 seed words per class, Student-MonoBERT performs remarkably well. Student's and Teacher's performance significantly increases with B K and most performance gains are obtained for lower values of B K . This is explained by the fact that CLTS prioritizes the most indicative seed words for translation. Therefore, as B K increases, the additional seed words that are translated are less indicative than the already-translated seed words and as a result have lower chances of translating to important seed words in the target language. The gap between the Teacher and Student performance has a maximum value of 40 absolute accuracy points and decreases as Teacher considers more seed words but does not get lower than 10, highlighting that Student learns predictive patterns in "unif" replaces a seed word with a different word sampled uniformly at random from V T , "freq" replaces a seed word with a word randomly sampled from V T with probability proportional to its frequency in D T , "adv" assigns a seed word to a different random class k = k by swapping its class weights inẐ.
L T that may never be considered by Teacher.
CLTS is robust to noisy translated seed words.
In practice, an indicative seed word in L S may not translate to an indicative word in L T . Our results above show that Student in CLTS performs well even when seed words are automatically translated across languages. To further understand our method's behavior with noisy translated seed words, we introduce additional simulated noise of different types and severities. According to Figure 6, "unif" and "freq" noise, which replace translated seed words with random words, affect CLTS less than "adv" noise, which introduces many erroneous teacher-labels. Student is less sensitive than Teacher to noisy seed words: their performance gap (*-Diff) increases with the magnitude of translation  Figure 7: Top 10 extracted seed words for the "medical emergency" class and their translations to Uyghur and Sinhalese. Google Translate erroneously returns "medical" as a Uyghur translation of the word "medical." noise (up to 0.7) for both "unif" and "freq" noise. Student's accuracy is relatively high for noise rates up to 0.3, even with "adv" noise: CLTS is effective even when 30% of the translated seed words are assumed indicative for the wrong class.

Addressing Emerging Classification Tasks in Low-Resource Languages
We now show a preliminary exploration of CLTS for detecting medical emergency situations in the low-resource Uyghur and Sinhalese languages by just translating B=50 English seed words across languages. Figure 7 shows the top 10 seed words transferred by CLTS for the medical emergency class. We train Student-LogReg because BERT is not available for Uyghur or Sinhalese. End-to-end training and evaluation of CLTS takes just 160 seconds for Uyghur and 174 seconds for Sinhalese. The accuracy in Uyghur is 23.9% for the teacher and 66.8% for the student. The accuracy in Sinhalese is 30.4% for the teacher and 73.2% for the student. The appendix has more details. These preliminary results indicate that CLTS could be easily applied for emerging tasks in low-resource languages, for example by asking a bilingual speaker to translate a small number of seed words. We expect such correct translations to lead to further improvements over automatic translations.

Conclusions and Future Work
We presented a cross-lingual text classification method, CLTS, that efficiently transfers weak supervision across languages using minimal crosslingual resources. CLTS extracts and transfers just a small number of task-specific seed words, and creates a teacher that provides weak supervision for training a more powerful student in the target language. We present extensive experiments on 4 classification tasks and 18 diverse languages, including low-resource languages. Our results show that even a simple student outperforms the teacher and previous state-of-the-art approaches with more complex models and more expensive resources, highlighting the promise of generating weak supervision in the target language. In future work, we plan to extend CLTS for handling cross-domain distribution shift (Ziser and Reichart, 2018) and multiple source languages (Chen et al., 2019). It would also be interesting to combine CLTS with available cross-lingual models, and extend CLTS for more tasks, such as cross-lingual named entity recognition (Xie et al., 2018), by considering teacher architectures beyond bag-of-seed-words.

A Appendix
For reproducibility, we provide details of our implementation (Section A.1), datasets (Section A.2), and experimental results (Section A.3). We will also open-source our Python code to help researchers replicate our experiments.

A.1 Implementation Details
We now describe implementation details for each component in CLTS: seed word extraction in L S , seed word transfer, and teacher-student co-training in L T .
Source Seed Word Extraction: The inputs to the classifier in L S are tf-idf weighted unigram vectors 7 . For the classifier, we use scikit-learn's logistic regression 8 with the following parameters: penalty="l1", C=λ B , solver="liblinear", multi_class="ovr". In other words, we address multi-class classification by training K binary "one-vs.-rest" logistic regression classifiers to minimize the L1-regularized logistic loss (LASSO).
(We use scikit-learn version 0.22.1, which does not support a "multinomial" loss with L1-penalized classifiers.) We tune λ B by computing the "regularization path" between 0.1 and 10 7 , evenly spaced on a log scale into 50 steps. To efficiently 9 compute the regularization path, we use the "warm-start" technique (Koh et al., 2007), where the solution of the previous optimization step is used to initialize the solution for the next one. This is supported in scikit-learn by setting the warm_start parameter of logistic regression to True.
10 https://github.com/facebookresearch/ MUSE#ground-truth-bilingual-dictionaries across classes, which might improve efficiency as "easier" classes may be modeled with fewer seed words. For Uyghur and Sinhalese, which have no entries in MUSE, we use Google Translate. For reproducibility, we cached the translations obtained from Google Translate and will share them with the code of the paper. If a source word has multiple translations in MUSE, 11 we use all translations as noisy target seed words with the same weight, while if a seed word has no translation in the target language, then we directly use it as a target seed word (this may be useful for named entities, emojis, etc.). Translations provided by a human annotator would possibly lead to better target seed words but, as we show here, even noisy automatic translations can be effectively used in CLTS.
Teacher-Student Co-Training: For the logistic regression (LogReg) student in L T , we use scikitlearn's logistic regression with default parameters (including penalty="l2", C=1). The inputs to Lo-gReg are tf-idf weighted n-gram (n=1,2) vectors. For our monolingual BERT (MonoBERT) student, we use the following pre-trained models from huggingface 12 : • English: bert-base-cased • Spanish: dccuchile/bert-base-spanish-wwmcased • French: camembert-base • German: bert-base-german-cased • Italian: dbmdz/bert-base-italian-xxl-cased • Russian: DeepPavlov/rubert-base-cased • Chinese: bert-base-chinese • Japanese: bert-base-japanese We use the default hyperparameters in the "Transformers" library (Wolf et al., 2019) and do not re-train (with the language modeling objective) MonoBERT in the target domain. To avoid label distribution shift because of iterative co-training, we balance teacher-labeled documents in D T by keeping the same number of documents across classes before training the student. We perform 3616 two rounds of teacher-student co-training, which has been shown to gain most of the improvement in Karamanolakis et al. (2019). Table 11 reports the model parameters for each dataset and language. We do not tune any model hyperparameters and use default values instead.

A.2 Dataset Details
Document Classification in MLDoc: The Multilingual Document Classification Corpus (ML-Doc 13 ; Schwenk and Li (2018)) contains Reuters news documents in English, German, Spanish, French, Italian, Russian, Chinese, and Japanese. Each document is labeled with one of the four categories: • CCAT (Corporate/Industrial) • ECAT (Economics) • GCAT (Government/Social)

• MCAT (Markets)
MLDoc was pre-processed and split by Schwenk and Li (2018) into 1,000 training, 1,000 validation, and 4,000 test documents for each language (Table 1). We use labeled training documents only in English for training the source classifier. We treat training documents in German, Spanish, French, Italian, Russian, Chinese, and Japanese as unlabeled in CLTS by ignoring the labels.

Review Sentiment Classification in CLS:
The Cross-Lingual Sentiment corpus (CLS 14 ; Prettenhofer and Stein (2010)) contains Amazon product reviews in English, German, French, and Japanese. Each language includes product reviews from three domains: books, dvd, and music. Each labeled document includes a binary (positive, negative) sentiment label. Table 2 reports dataset statistics. Validation sets are not available for CLS. We use labeled training documents only in English for training the source classifier. We ignore training documents in German, French, and Japanese, and use unlabeled documents in CLTS.

Language
Train Dev Test English (En) 1,000 1,000 4,000 German (De) 1,000 1,000 4,000 Spanish (Es) 1,000 1,000 4,000 French(Fr) 1,000 1,000 4,000 Italian (It) 1,000 1,000 4,000 Russian (Ru) 1,000 1,000 4,000 Chinese (Zh) 1,000 1,000 4,000 Japanese (Ja) 1,000 1,000 4,000  English (En), Spanish (Es), Croatian (Hr), Hungarian (Hu), Polish (Pl), Portuguese (Pt), Slovak (Sk), Slovenian (Sl), and Swedish (Sv). We use the preprocessed and tokenized data provided by (Rasooli et al., 2018). In addition to these tweets, Rasooli et al. (2018) also use pre-processed and tokenized Persian (Fa) product reviews from the SentiPers corpus (Hosseini et al., 2018) and manually labeled Uyghur (Ug) documents from the LDC LORELEI corpus. On the above datasets, each document is labeled with a sentiment label: positive, neutral, or negative.  fication to medical versus non-medical emergency need. Unfortunately, our number of labeled documents for each language is different than that reported in Yuan et al. (2020). In English, we use 806 labeled documents for training the source classifier. In Uyghur, we use 5,000 unlabeled documents for training the student and 226 labeled documents for evaluation. In Sinhalese, we use 5,000 unlabeled documents for training the student and 36 labeled documents for evaluation. Given the limited number of labeled documents, we do not consider validation sets for our experiments.

A.3 Experimental Result Details
We now discuss detailed results on each dataset. In addition to baselines reported in the main paper, we also report supervised classifiers (*-sup) that were trained on each language separately using the labeled training data, to get an estimate for the maximum achievable performance. We run CLTS 5 times using the following random seeds: [7,20,42,127,1993] and report the average performance results and the standard deviation across different runs. (The standard deviation for our Lo-gReg student is negligible across all datasets so we do not report it.) We report the results for the configuration of B that achieves the best validation performance (accuracy for MLDoc, macro-average F1 for TwitterSent) and also report the validation performance, when a validation set is available.  Table 4a, we have reported the LASER configuration that achieves the best performance for each language.) As expected, the performance of supervised models that consider in-language training datasets is higher than crosslingual models. Table 7 reports results on CLS per domain. (In Table 4b, we reported the average performance across domains for each language.) Note that MultiFiT-sup has substantially higher accuracy than MonoBERT-sup and LogReg-sup. This indicates that MulfiFit is probably a better model for this task. It would be interesting to evaluate in the future whether using MultiFiT as student outperforms Student-MonoBERT. Table 8 reports results on TwitterSent, SentiPers, and LORELEI. We have reported the best performing approaches in Rasooli et al. (2018) that use En as a source language. We noticed that CLTS achieves best validation performance using more seed words in the Twitter corpora compared to the MLDoc and CLS corpora. We hypothesize that because Twitter posts are shorter than news documents or reviews, the context of seed words is less rich in indicative words and so the student requires larger teacher-labeled datasets to be effective. Note, however, that even with a tighter budget of B=60, CLTS-Student has an average accuracy of 40.5% and outperforms previous approaches relying on dictionaries or comparable corpora. Table 4 reports the 10 most important seed words extracted for each of the four news document classes in CLS. Table 5 reports the 10 most important seed words extracted for each binary class and domain in CLS. Figure 8 reports the 20 most important seed words extracted for each of the 3 sentiment classes in TwitterSent, SentiPers and LORELEI. Figure 9 reports the 20 most important seed words extracted for the medical situation class in LORELEI and their translations to Uyghur and Sinhalese.

Examples of Extracted Seed Words:
Testing CLTS in Non-English Source Languages: To evaluate whether our results generalize to non-English source languages, we run additional experiments using De, Es, and Fr as source languages in CLS. For those experiments, we also consider En as a target language. Table 9 reports the evaluation results. Across all configurations, there is no clear winner between MultiCCA and MultiBERT, but our Student-LogReg consistently outperforms both approaches, indicating that CLTS is also effective with non-English source languages.
Ablation Study: Table 10 reports results on MLDoc by changing parts of CLTS. The first row reports Student-Logreg without any changes. Change (a): using the clarity-scoring (similar to tf-idf weighting) method of (Angelidis and Lapata, 2018) leads to 3% lower accuracy than extracting seed words from the weights of a classifier trained through sparsity regularization. Change (b): obtaining translations through Google Translate leads to 0.8% lower accuracy than using bilingual MUSE dictionary. We observed that Google Translate sometimes translates words to wrong translations without extra context, while MUSE dictionaries provide more accurate translations. Change (c): updating Teacher similar to Karamanolakis et al. (2019), where the Teacher updates seed word qualities but does not consider documents without seed words during training, leads to 1.3% lower accuracy than our approach, which replaces the teacher by the student and thus considers even documents without seed words. Change (d): removing seed words from Student's input leads to 2.8% lower accuracy than letting Student consider both seed words and non-seed words. This shows that even without using seed words, Student still performs accurately (77.2% accuracy across languages), indicating that Student successfully exploits indicative features in the context of the seed words.
Runtime: Table 12 reports the end-to-end runtime for each experiment (i.e., the total time needed to run the script), which includes: loading data, training, and evaluating CLTS. The runtime does not include dataset pre-processing, which was performed only once. We ran all experiments on a server with the following specifications: 16 CPUs, RAM: 188G, main disk: SSD 1T, storage disk: SDD 3T, GPU: Titan RTX 24G. CCAT company, inc, ltd, corp, group, profit, executive, newsroom, rating, shares ECAT bonds, economic, deficit, inflation, growth, tax, economy, percent, foreign, budget GCAT president, police, stories, party, sunday, people, opposition, beat, win, team MCAT traders, futures, dealers, market, bids, points, trading, day, copper, prices