Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings

In the absence of annotations in the target language, multilingual models typically draw on extensive parallel resources. In this paper, we demonstrate that accurate multilingual part-of-speech (POS) tagging can be done with just a few (e.g., ten) word translation pairs. We use the translation pairs to establish a coarse linear isometric (orthonormal) mapping between monolingual embeddings. This enables the supervised source model expressed in terms of embeddings to be used directly on the target language. We further reﬁne the model in an unsupervised manner by initializing and regularizing it to be close to the direct transfer model. Averaged across six languages, our model yields a 37.5% absolute improvement over the monolingual prototype-driven method (Haghighi and Klein, 2006) when using a comparable amount of supervision. Moreover, to highlight key linguistic characteristics of the generated tags, we use them to predict typological properties of languages, obtaining a 50% error reduction relative to the prototype model. 1


Introduction
After two decades of study, the best performing multilingual methods can in some cases approach their supervised monolingual analogues. To reach this level of performance, however, multilingual methods typically make use of significant parallel resources such as parallel translations or bilingual dic-1 Our code and data are available at https://github. com/yuanzh/transfer_pos. tionaries. These resources act as substitutes for explicit annotations available in the target language for supervised methods. It is less clear what can be done without extensive parallel resources. Indeed, the motivation for our paper comes from trying to understand how little parallel data is necessary for effective multilingual transfer.
In this paper, we demonstrate that only ten word translation pairs suffice for effective multilingual transfer of part-of-speech (POS) tagging. To achieve this we make use of and integrate two sources of statistical signal. First, we enable transfer of information from the source to target languages by establishing a coarse mapping between word embeddings in two languages on the basis of the few available translation pairs. The mapping is useful because of significant structural similarity of embedding spaces across languages. Second, we leverage the potential of unsupervised monolingual models to capture language-specific syntactic properties. The two sources of signals are largely complementary. Embeddings provide a coarse alignment between languages while unsupervised methods fine tune the correspondences in service of the task at hand. While unsupervised methods are fragile and challenging to estimate in general, they can be helpful if initialized and regularized properly, which is our focus.
In order to transfer annotations, we align monolingual embeddings between languages. However, a full fine-grained alignment is not possible with only ten translation pairs due to differences between the languages and variations across raw corpora from which the embeddings are derived. Instead, we re-strict the initial coarse mapping to be linear and isometric (orthonormal) so as to leave lengths and angles between the word vectors invariant. One advantage is that this preserves cosine similarity between vectors, which is viewed as a proxy for syntactic/semantic similarity (Mikolov et al., 2013a;Pennington et al., 2014;Herbelot and Vecchi, 2015). The resulting coarse alignment is then used to initialize and guide an unsupervised model over the target language.
Our unsupervised model is a feature-based hidden Markov model (HMM) expressed in terms of word embeddings. By establishing a common multilingual embedding space, we can map the source HMM estimated from supervised annotations directly to the target. The resulting "direct transfer" model should be further adjusted as languages differ, and the initial alignment obtained based on embeddings is imperfect. For this reason we cast the direct transfer model as a regularizer for the target HMM, and permit the HMM to further adjust the embedding transformations and relations of embeddings to the tags both globally (overall rotation and scaling) and locally (introducing small corrections).
Our two phase approach is simple to implement, performs well, and can be adapted to other NLP tasks. We evaluate our approach on POS tagging using the multilingual universal dependency treebanks (Nivre et al., 2016). Specifically, we use English as the source language and test on three Indo-European languages (Danish, German and Spanish) and three non-Indo-European-languages (Finnish, Hungarian and Indonesian). Experimental results show that our method consistently outperforms various baselines across languages. On average, our full model achieves 8% absolute improvement over the direct transfer counterpart. We also compare against a prototype-driven tagger (Haghighi and Klein, 2006) using 14 prototypes as supervision. Our model significantly outperforms Haghighi and Klein (2006)'s model by 37.5% (67.5% vs 30%).
We also introduce a novel task-based evaluation of automatic POS taggers, where tagger predictions are used to determine typological properties of the target language. This evaluation highlights key linguistic features of the generated tags. On this task, our model achieves 80% accuracy, yielding 50% error reduction relative to the prototype model.

Related Work
Multilingual POS Tagging Prior work on multilingual POS tagging has mainly focused on the tag projection method (Yarowsky et al., 2001;Wisniewski et al., 2014;Duong et al., 2013;Duong et al., 2014;Snyder et al., 2008;Naseem et al., 2009;Chen et al., 2011). All these approaches assume access to a large amount of parallel sentences to facilitate multilingual transfer. In our work, we focus on a more challenging scenario, in which we do not assume access to parallel sentences. Instead of projecting tag information via word alignment, the transfer in our model is driven by mapping multilingual embedding spaces. Kim et al. (2015) also use latent word representations for multilingual transfer. However, similarly to prior work, this representation is learned using parallel data.
The feasibility of POS tagging transfer without parallel data has been shown by Hana et al. (2004). The transfer is performed between typologically similar languages, which enables the model to directly transfer the transition probabilities from source to the target. Moreover, emission probabilities are hand-engineered to capture language-specific morphological properties. In contrast, our method does not require any languagespecific knowledge on the target side.

Multilingual Word Embeddings
There is an expansive body of research on learning multilingual word embeddings (Gouws et al., 2014;Faruqui and Dyer, 2014;Lu et al., 2015;Lauly et al., 2014;Luong et al., 2015). Previous work has shown its effectiveness across a wide range of multilingual transfer tasks including tagging (Kim et al., 2015), syntactic parsing (Xiao and Guo, 2014;Guo et al., 2015;Durrett et al., 2012), and machine translation (Zou et al., 2013;Mikolov et al., 2013b). However, these approaches commonly require parallel sentences or bilingual lexicon to learn multilingual embeddings. Vulic and Moens (2015) have alleviated the requirements by inducing multilingual word embeddings directly from a document-aligned corpus such as a set of Wikipedia pages on the same theme but in different languages. However, they still used about ten thousands aligned documents as parallel supervision. Our work demonstrates that useful multi-lingual embeddings can be learned with a minimal amount of parallel supervision.

Multilingual POS Tagger
Our method is designed to operate in the regime where there are no parallel sentences or target annotations. We assume only a few, in our case ten, word translation pairs. This small number of translation pairs together with the tags that they carry from the source to the target do not provide sufficient information to train a reasonable supervised tagger, even for very close languages where word translations would be mostly one-to-one and tags fully preserved in translation. Other cues are necessary.
The few translation pairs provide just enough information to obtain a coarse global alignment between the source and target language embeddings. We limit the initial linear transformation between embeddings to isometric (orthonormal) mappings so as to preserve norms and angles (e.g., cosine similarities) between words. Once the embeddings are aligned, any source language model expressed in terms of embeddings can be mapped to a target language model. The approach is akin to direct transfer commonly applied in parsing Zeman and Resnik, 2008) though often with more information. We use the term "direct transfer" to mean the process where no further adjustment is performed beyond the immediate mapping via (coarsely) aligned embeddings.
Direct transfer is insufficient between languages that are syntactically (even moderately) divergent. Instead, we use the directly transferred model to initialize and regularize an unsupervised tagger.
Specifically, we employ a feature-based HMM (Berg-Kirkpatrick et al., 2010) tagger for both the source and target languages with two important modifications. The emission probabilities in the source language HMM are expressed solely in terms of word embeddings (cf. skip-gram models). Such distributions can be directly transferred to the target domain. Our target language HMM is, however, equipped with additional adjustable parameters that can be learned in an unsupervised manner. These include parameters for modifying the initial global linear transformation between embeddings. Beyond this linear transformation, we also add "correction terms" to each tag-word pair that are in principle sufficient to specify any HMM. Both of these additional sets of parameters are regularized towards keeping the initial direct transfer model. As a result, our strongly governed unsupervised tagger can succeed where an unguided unsupervised tagger would typically fail.
In the remainder of this section, we describe the approach more formally, starting with the coarse alignment between embeddings, followed by the supervised feature-based HMM, and the unsupervised target language HMM.

Isometric Alignment of Word Embeddings
Here we find a linear transformation from the target language embeddings to the source language embeddings using the translation pairs. The resulting transformation permits us to directly apply any source language model on the target language, i.e., it enables direct transfer. To this end, let V s ∈ R ns×d and V t ∈ R nt×d be the word embeddings estimated for the source and target languages, respectively, with vocabulary sizes n s and n t . All the embeddings are of dimension d. The submatrices of embeddings pertaining to k anchor words (from translation pairs) are denoted as Σ s and Σ t , where Σ s , Σ t ∈ R k×d .
We find a linear transformation P ∈ R d×d that best aligns the embeddings of the translation pairs in the sense of minimizing subject to the isometric (orthonormal) constraint P T P = I. We use the steepest descent algorithm (Abrudan et al., 2008) to solve this optimization problem. 2 Once P is available, we can map all the target language embeddings V t to the source language space with V t P . Note that since typically in our setting k < d (e.g. k = 10) additional constraints such as isometry are required. vectors as a measure of semantic relations between words. Thus, for example, if two words have high cosine similarity in German (target), the corresponding words in English (source) should also be similar. To validate our isometric constraint further, we verify whether nearest neighbors are preserved in monolingual embeddings after translation. To this end, we take the top 1,000 most frequent words in German and their translations into English and ask whether nearest neighbors are preserved if measured in terms of their monolingual embeddings. For each word vector w 1 and its nearest neighbor w 2 in German, let e 1 and e 2 be the corresponding English vectors. We compute the rank of e 2 in the ordered list of nearest neighbors of e 1 . As Figure 1 shows, in more than 50% of word pairs, e 2 is among the top-2 neighbors of e 1 . In over 90% of the word pairs e 2 is among e 1 's top-10 closest neighbors. For the purposes of comparison (see Section 5), we introduce also a linear transformation without isometry. In other words, we find P that minimizes ||Σ t P − Σ s || 2 via the Moore-Penrose pseudoinverse (Moore, 1920;Penrose, 1955). Specifically, let Σ + t be the pseudoinverse of Σ t . Then the solution takes the form P = Σ + t Σ s , and has the minimum Frobenius norm among all possible solutions.

Supervised Source Language HMM
Here we briefly describe how we train a supervised tagger on the source language. The resulting model, together with aligned embeddings, specifies the direct transfer model. It will also be used to initialize and guide the unsupervised tagger on the target lan-guage.
Our model has the same structure as the standard HMM but we replace the transition and emission probabilities with log-linear models (cf. featurebased HMM by Berg-Kirkpatrick et al. (2010)). The transition probabilities include all indicator features and therefore impose no additional constraints. The emission probabilities, in contrast, are expressed entirely in terms of word embeddings v x as features. More formally, the emission probability of word x given tag y is given by Note that the parameters µ y (one vector per tag) can be viewed as tag embeddings. This supervised tagging model is trained to maximize the joint loglikelihood with l 2 -regularization over parameters.
We use the L-BFGS (Liu and Nocedal, 1989) algorithm to optimize the parameters.
Once the HMM has been trained, we can specify the direct transfer model. It has the same transition probabilities but the emission probabilities are modified according to p dt θ (x|y) ∝ exp{ v T x P µ y } where v x is now the monolingual target embedding, transformed into the source space via v T x P . We apply the Viterbi algorithm to predict the most likely POS tag sequence.

Unsupervised Target Language HMM
Our unsupervised HMM for the target language is strictly more expressive than the direct transfer model so as to better tailor it to the target language. Let v x again be the monolingual target embeddings estimated separately, prior to the HMMs. We map these vectors to the source language embedding space via v T x P as discussed earlier, where P is already set and no longer considered a parameter. The form of the emission probabilities includes two modifications to the direct transfer model. First, we have introduced an additional global linear transformation M to correct the initial alignment represented by P . Second, we include per-symbol parameters θ x,y which, in principle, are capable of specifying any emission distribution on their own. The adjustable parameters in this model (denoted collectively θ) are M , {µ y }, {θ x,y }, and the parameters pertaining to the transition probabilities. If we set M = I, θ x,y = 0 for all x and y, and borrow µ y and the transition parameters from the supervised HMM, then we recover the direct transfer model. Let θ 0 denote this setting of the parameters. In other words, the unsupervised HMM with initial parameters θ 0 is the direct transfer model. Our approach include initializing θ = θ 0 and later regularizing θ to remain close to θ 0 . The motivation behind this approach is two-fold. First, the initial alignment between embeddings was obtained only on the basis of the few available anchor words and may therefore need to be adjusted. Note that the linear transformation of embeddings now involves scaling and is no longer necessarily isometric. Second, the source and target languages differ and the embeddings are not strictly related to each other via any global linear transformation. We can interpret parameters θ x,y as local (per word) non-linear deformations of the embedding vectors that specify the emission probabilities. We allow only small non-linear corrections by regularizing θ x,y to remain close to zero, i.e., the values they have in θ 0 .
Our unsupervised HMM is estimated by maximizing the regularized log-likelihood where x i is the i th target language sentence, P θ (x i ) is the HMM with parameters θ, and n is the number of sentences in the target text to be annotated. Since all the parameters in the model are in a log-linear form, we simply use the regularization parameter β. Once estimated, we use the Viterbi algorithm to predict the most likely POS tag sequence.

Estimation Details
We maximize L(θ) using the Expectation-maximization (EM) algorithm. In the E-step, we evaluate expected counts e y ,y for tagtag and e x,y for word-tag pairs, using the forwardbackward algorithm. The M-step searches for θ that maximizes l(θ) = y ,y e y ,y log p t θ (y |y) + x,y e x,y log p t θ (x|y) The maximization can be be done via L-BFGS which involves computing the gradients of log p t θ (y |y) and log p t θ (x|y) with respect to θ at every iteration. Because the conditional probabilities are expressed in a log-linear form, the gradients take on typical forms such as where µ 0y are initial values for µ y .

Experimental Setup
Dataset We evaluate our method on the latest Version 1.2 of the Universal Dependencies Treebanks (Nivre et al., 2016;McDonald et al., 2013). We use English as the source language and six other languages as targets. Specifically, we choose three Indo-European languages: Danish (da), German (de), Spanish (es), and three non-Indo-European languages: Finnish (fi), Hungarian (hu), Indonesian (id). All treebanks are annotated with the same universal POS tagset. In our work, we map proper nouns to nouns and map symbol marks 3 and interjections to a catch-all tag X because it is hard and unnecessary to disambiguate them in a low-resource learning scenario. After mapping, our tagset includes the following 14 tags: noun, verb, auxiliary verb, adjective, adverb, pronoun, determiner, adposition, numeral, conjunction, sentence conjunction, particle, punctuation mark, and a catch-all tag X. Note that this universal tagset contains two more tags than the traditional universal tagset proposed by : auxiliary verb and sentence conjunction. We follow the standard split of the treebanks for every language. For each target language, we use the sentences in the training set as unlabeled data, and evaluate on the testing set.
Word Embeddings To induce monolingual word embeddings, we use the processed Wikipedia text dumps (Al-Rfou et al., 2013) for each language.  While Wikipedia texts may contain parallel articles, we show in Table 1 that the amount of text varies significantly across languages. Smith et al. (2010) also demonstrated that parallel information in Wikipedia is very noisy. Therefore, direct translations are difficult to get from these texts. We use the word2vec tool with the skip-gram learning scheme (Mikolov et al., 2013a). In our experiments we use d = 20 for the dimension of word embeddings and w = 1 for the context window size of the skip-gram, which yields the best overall performance for our model. In our analysis, we also explore the impact of embedding dimension and window size.

Word Translation Pairs
For each target language, we collect English translations for the top ten most frequent words in the training corpus. Our preliminary experiments show that this selection method performs the best. The selected words are typically from closed classes, such as punctuation marks, determiners and prepositions. We find translations using Wiktionary. 4 Model Variants Our model varies along two dimensions. On one dimension, we use two different methods for inducing multilingual word embeddings: Pseudoinverse and Isometric alignment as described in Section 3.1. On the other dimension, we experiment with two different multilingual transfer models. We use Direct Transfer to denote our direct transfer model, and Transfer+EM for our unsupervised model trained in the target language.
Baselines We also compare against the prototypedriven method of Haghighi and Klein (2006). Specifically, we use the publicly available implementation provided by the authors. 5 Note that their model requires at least one prototype for each POS category. Therefore, we select 14 prototypes (the most frequent word from each category) for the baseline, while our method only uses ten translation pairs.
Evaluation Unlike other unsupervised methods, all models in our experiments can identify the label for each POS tag because of knowledge from either the source languages or prototypes. Therefore, we directly report the token-level POS accuracy for all experiments.
Other Details For all experiments, we use the following regularization weights: γ = 0.001 for supervised models learned on the source language and β = 0.01 for unsupervised models learned on the target language. During training, we also normalize the log-likelihood of labeled or unlabeled data by the total number of tokens. As a result, the magnitude of the objective value is independent of the corpus size, hence we do not need to tune the regularization weight for each target language. We run ten iterations of the EM algorithm.

Results
In this section, we first show the main comparison between the tagging performance of our model and the baselines. In addition, we include an experiment on typology prediction. In Section 5.2, we provide a more detailed analysis of model properties.  Table 2: Token-level POS tagging accuracy (%) for different variants of our transfer model. We always use English as the source language. Target languages include Danish (da), German (de), Spanish (es), Finnish (fi), Hungarian (hu) and Indonesian (id).

Main Results
We average the results separately for Indo-European and non-Indo-European languages. The first row shows performance of the prototype-driven baseline (Haghighi and Klein, 2006). The rest shows results of our model when multilingual embeddings are induced with the pseudoinverse or isometric alignment method. "Direct Transfer" and "Transfer+EM" indicates our direct transfer model and our transfer model trained in the target language respectively.
get language (Transfer+EM model) consistently improves over the direct transfer counterpart. As the bottom part of Table 2 shows, running EM on unlabeled data yields an average of 12% absolute gain on Indo-European languages, while on non-Indo-European languages the gain is only 4.4%.

Impact of the Isometric Alignment Constraint
As Table 2 shows, when we use Transfer+EM models, the isometric alignment method yields a 4.5% improvement over the pseudoinverse method (72.9% vs. 68.4%) on Indo-European languages. However, the improvement margin drops to 0.3% on non-Indo-European languages (62.1% vs. 61.8%). We hypothesis that this discrepancy is due to the difference in the degree of ambiguities of the anchor words across languages. For example, the anchor words of Spanish have an average of 1.5 possible translations to English, while for Indonesian the average ambiguity is 2.7. Therefore, the isometric assumption holds better and the EM algorithm finds a better local optimum for Indo-European languages than for non-Indo-European languages. We also observe a similar pattern in the direct transfer scenario.

Prediction of Linguistic Typology
To assess the quality of automatically generated tags, we use them to determine typological properties of the target language. We predict values of the following five typological properties for each language: subject-  verb, verb-object, adjective-noun, adposition-noun and demonstrative-noun. More specifically, the goal is to predict word ordering preferences such as whether an adjective comes before a noun (as in English) or after a noun (as in Spanish). We collect the true ordering preferences from "The World Atlas of Language Structure (WALS)" (Dryer et al., 2005). To make predictions, we train a multiclass support vector machine (SVM) classifier (Tsochantaridis et al., 2004) on a multilingual corpus using bigrams and trigrams of POS tags as features. The training data for SVM comes from a combination of the Universal Dependencies Treebanks, CoNLL-X, and CoNLL-07 datasets (Buchholz and Marsi, 2006;Nilsson et al., 2007), excluding all sentences in the target language. We train one classifier for each typological property, and make predictions for each of the six target languages. For evaluation, we directly report the overall accuracy on all 30 test cases (six languages combined with five typological prop- erties). Table 3 shows the accuracy of predicting typological properties with different tagging models. "Gold" corresponds to the result with gold POS annotations and is an upper bound of the prediction accuracy. We observe that the typology prediction accuracy correlates with the tagging quality. With the output of our best model, we predict the correct values for 80% of the typological properties. This corresponds to a 50% error reduction relative to the prototype model.

Analyses
Impact of the Amount of Supervision Figure 2 shows the accuracy of the Direct Transfer, Transfer+EM models, and prototype baseline with different amounts of supervision in German. Specifically, the x-axis is the number of translation pairs or prototypes used as supervision. The numbers with ten pairs or prototypes are the same as that in Table 2. We automatically extract more translation pairs using the Europarl parallel corpus (Koehn, 2005) and select pairs based on the word frequency in the target language. For the prototype model, we select the most frequent words as prototypes based on annotations in the training data, and guarantee that each POS category has at least one prototype. Note that the minimum number of prototypes used by the prototype model is 14.
One particularly interesting observation is that our model with ten pairs achieves an equivalent performance as that of the prototype-driven method with 150 prototypes. Multilingual transfer compensates for 15 times the amount of supervision. We also observe that the prototype-driven model outperforms our model when large amount of annotations are available. This can be explained by noise in the translation and the limitation from the linear embedding mapping process, which makes POS tags not preserve well across languages. When comparing between our models, Figure 2 shows that Transfer+EM consistently improves over the Direct Transfer, while the gains are more profound in the low-supervision scenario. This is not surprising because with more translation pairs, we are able to induce higher quality multilingual embeddings, which is more beneficial to the direct transfer model.

Impact of Embedding Dimensions and Window
Size Figure 3 shows the average accuracy across six target languages with different embedding dimensions and context window sizes. First, we observe that a small window size w = 1 consistently outperforms window size w = 5, demonstrating that smaller window sizes appear to produce word embeddings better for POS tagging. This observation is in line with the finding by Lin et al. (2015). Moreover, we obtain the best performance with dimension d = 20 when w = 1. On one hand, embeddings with smaller dimension (e.g. d = 10) have too little syntactic information for good POS tagging. On the other hand, if the embedding space has larger dimen-  sion, the space will be more complex and mapping embedding spaces will be more difficult given only ten translation pairs. Therefore, we observe a performance drop with either smaller or larger dimensions.
Ablation Analysis on Features In our Transfer+EM model, we add indicator features and transformation matrix M to enhance the emission distribution (see Section 3.3). To analyze their contribution, we remove these features in turn and report the results in Table 4. Averaged over all languages, adding indicator features improves the accuracy by 3.7%, and adding a transformation matrix increases the accuracy by 2.8%.

Conclusions
In this paper, we demonstrate that ten translation pairs suffice for an effective multilingual transfer of POS tagging. Experimental results show that our model significantly outperforms the direct transfer method and the prototype baseline. The effectiveness of our approach suggests its potential application to a broader range of NLP tasks that require word-level multilingual transfer, such as multilingual parsing and machine translation.