Simple task-specific bilingual word embeddings

We introduce a simple wrapper method that uses off-the-shelf word embedding algorithms to learn task-speciﬁc bilingual word embeddings. We use a small dictionary of easily-obtainable task-speciﬁc word equivalence classes to produce mixed context-target pairs that we use to train off-the-shelf embedding models. Our model has the advantage that it (a) is independent of the choice of embedding algorithm, (b) does not require parallel data, and (c) can be adapted to speciﬁc tasks by re-deﬁning the equivalence classes. We show how our method outperforms off-the-shelf bilingual embeddings on the task of unsupervised cross-language part-of-speech (POS) tagging, as well as on the task of semi-supervised cross-language super sense (SuS) tagging.


Introduction
Using multi-layered neural networks to learn word embeddings has become standard in NLP (Turian et al., 2010;Guo et al., 2014). While there is still some controversy whether such methods are superior to older methods (Levy and Goldberg, 2014;Baroni et al., 2014), there is little doubt that continuous word representations can potentially solve some of the data sparsity problems inherent in NLP.
Most research on word embeddings has focused on learning representations for the words in a single language, making syntactically or semantically similar words appear close in the embedding space. Embeddings have been applied to many tasks, from * The authors contributed equally to this work. The second author is funded by the ERC Starting Grant LOWLANDS No. 313695. named entity recognition (Turian et al., 2010) to dependency parsing (Bansal et al., 2014). It has furthermore been shown that weakly supervised embedding algorithms can also lead to huge improvements for tasks like sentiment analysis (Tang et al., 2014). In this work, we also use weak or distant supervision, relying on small dictionary seeds. This paper, however, considers the problem of learning bilingual word embeddings, i.e., word embeddings such that similar words in two different languages end up close in the embedding space. Such bilingual word embeddings can potentially be used for better cross-language transfer of NLP models, as we show in this paper. Previous work on bilingual word embeddings have defined similar words as translation equivalents and evaluated embeddings in the context of document classification tasks (Klementiev et al., 2012;Kocisky et al., 2014). In this paper, we present a simple wrapper method to existing monolingual word embedding algorithms that can learn task-specific bilingual embeddings, e.g., for POS tagging, named entity recognition, or sentiment analysis. Our algorithm is simpler and performs better on the tasks where we could compare performance to existing algorithms. Also, we note that our approach, unlike existing algorithms (Klementiev et al., 2012), is as fast as learning monolingual embeddings.
Our contributions In this paper we introduce a new approach for learning bilingual word embeddings and revisit the task of unsupervised crosslanguage POS tagging . Our bilingual embedding model, which we call Bilingual Adaptive Reshuffling with Individual Stochastic Alternatives (BARISTA), takes two (non-parallel) corpora and a small dictionary as input. The dictio-nary is essentially a list of words in the two languages that are equivalent with respect to some task, e.g., English car and French maison ('house') are both nouns, and hence "equivalent" in POS tagging; English clerk and chauffeur are both persons, and hence "equivalent" in SuS tagging; house and maison are equivalent in machine translation. BARISTA has the advantage that it (a) is independent of the choice of embedding algorithm, (b) does not require parallel data, and (c) can be adapted to specific tasks by using appropriate dictionaries. We use the bilingual embeddings directly to train a target language POS tagger on source language training data. Instead of lexical features, we use the bilingual embeddings. We show our bilingual embedding method outperforms using off-the-shelf bilingual embeddings on this task, and that our system is competitive to state-of-the-art approaches for cross-language POS tagging. Finally, we show that the same embeddings also lead to significantly better performance in semi-supervised cross-language SuS tagging. The code will be made publicly available at https: //github.com/gouwsmeister/barista.

Our approach
Standard monolingual neural language models are unsupervised models that train on raw text, learning word features that enable the model to predict the next word (the target) from a sequence of words (the context). In the process, the model learns to cluster words into soft equivalence classes (words that have similar distributions).
Several authors have proposed bilingual clustering and embedding algorithms based on parallel data (Täckström et al., 2012;Klementiev et al., 2012;Zou et al., 2013;Kocisky et al., 2014;Hermann and Blunsom, 2014b). These authors have all evaluated their embeddings on document classification and machine translation, and not yet structured prediction tasks like POS/SuS tagging or syntactic parsing. A notable exception is Hermann and Blunsom (2014a), who do not rely on parallel data and do not use word alignments, but they still use comparable data and sentence alignments, and they only evaluate their embeddings in document classification.
The assumption that large amounts of parallel data exists for a language pair of interest is sometimes too strong (Hermann and Blunsom, 2014a).
On the other hand, we often have access to small samples of near-equivalences from knowledge bases of various forms. For example, for POS tagging, we often have access to small-to-sizeable crowdsourced tag dictionaries (e.g. Wiktionary). For SuS tagging, which is the other example considered in this paper, we sometimes have access to WordNets or similar resources. If we have such resources for both source and target language, we can extract word-equivalences from them and use these to learn bilingual embeddings using our proposed method.
In this paper, we experiment with using both equivalences based on word alignments (e.g., house ∼ maison), and equivalences based on knowledge bases (e.g., car ∼ maison). Crucially, our approach to learning bilingual embeddings only assumes a small seed of equivalences, no parallel data. It then uses these to produce a set of mixed contexttarget pairs.
Our input is a source corpus C s and a target corpus C t , as well as a set of bilingual equivalences R ∼ . We begin by shuffling the concatenation of C s and C t . We then pass over this mixed corpus, and for each word w, if {w | w, w ∈ R ∼ } is non-empty and of cardinality k, i.e., w is in the seed list of equivalences, we replace w with w with probability 1/2k. In other words, we flip a coin whether to replace w, and then randomly choose one of its equivalences as our replacement. For example, using translation equivalence classes, one could generate any of the following mixed texts from the English sentence build the house: construire the house, build la maison, build the maison, etc., or any other combination of English and French words with the English word order. With POS equivalence classes, any of the words in build the house can be replaced with words with overlapping syntactic categories, e.g., build the voiture.

Experiments
In our experiments we balance the source and target corpora, by subsampling from the bigger corpus. The vocabularies for all models are kept unrestricted, and result in around 1M words per language pair. We train with a window of 4 words on either side of the target word, using linear discounting of the initial learning rate of 0.1. These parameters were set on the Spanish POS data (see §3.2). We Both our POS tagging evaluation datasets, as well as the Wiktionaries, 2 are mapped to Google's universal tagset . We use the derived dictionaries for extracting bilingual POS equivalence classes. For SuS tagging, we only consider English-Danish and extract equivalence classes from Princeton WordNet and DanNet. For translation equivalents, we made use of Google Translate.

Qualitative Evaluation
In our main experiments ( §3.2-3.3), we use data from Wikipedia, but we first present a qualitative evaluation of English-German bilingual embeddings learned from the smaller Europarl corpus. 3 Note that while this is parallel data, we do not exploit its parallel nature.
POS classes The embeddings learned from English-German Europarl using POS classes from Wiktionary were visualized using the t-SNE technique (Van der Maaten and Hinton, 2008), and are shown in Fig 1. The model learns to cluster words very distinctively by their POS tag. Monolingual word embedding models with short context windows typically cluster by POS, but here we see the same effect for bilingual embeddings, i.e. words from both languages with the same POS tag cluster together. Individual words, on the other hand, do not appear to retain the fine-grained relationships we 1 https://code.google.com/p/word2vec/ 2 https://code.google.com/p/ wikily-supervised-pos-tagger/ 3 http://www.statmt.org/europarl/ normally observe in word embeddings (where similar words cluster closer together). The question thus is whether such embeddings are more or less useful for cross-language POS tagging than bilingual embeddings based on translation equivalencies. Translation classes Next, we induced English-German bilingual embeddings on Europarl using translations obtained from Google Translate. We derived a dictionary of the top 20k most frequent words in English, translated into German. The embeddings are shown in Fig 2. The visualizations show that the models are able to extract very finegrained bilingual relationships, and some clusters still correspond to POS.

Cross-language part-of-speech tagging
Next we evaluated the embeddings in the context of unsupervised cross-language POS tagging . The goal is to train a tagger -in our case, we use SEARN (Daume et al., 2009), following (Johannsen et al., 2014)-on labeled English data decorated with bilingual embeddings, and then evaluate the model on another target language. We use data from Danish, German, Spanish, Italian, Dutch, Portuguese, and Swedish. Training and test data, which are the same used in , were converted to use the same 12 universal parts of speech proposed by .
All results in this section were obtained by training bilingual embedding models on the publicly- available, pre-tokenized versions of Wikipedia, 4 and then using the trained embeddings as features in a publicly available implementation of the structured perceptron. 5 We use ortographic features, as well as the embedding vector of the target word. In addition, we use type constraints from Wiktionary to prune the search lattice during decoding (Täckström et al., 2013). We scaled the embedding features in the same way as Turian et al. (2010) with scaling parameter 0.01 (set on Spanish POS data). Our baseline method is a type-constrained structured perceptron with only ortographic features, which are expected to transfer across languages. We also experiment with using random embeddings, as well as the embeddings provided by Klementiev et al. (2012) (Klmtv). 6 Our results are displayed in Table 1. POS-X refer to X-dimensional BARISTA embeddings trained with POS equivalence classes, and Tr-X is X-dimensional BARISTA embeddings trained using translation equivalence classes. We note that random embeddings improve over our baseline, suggesting that the random features act as regularizers. The embeddings provided by Klementiev et al. (2012) seem to lead to worse performance than random embeddings, presumably because they capture mostly semantic (topic) similarity. We also compare our results to those reported by Berg-Kirkpatrick et al. (2010) (B-K) and Das and Petrov (2011) (DP), but note that their approaches require in-sample unlabeled data, and in the latter case, also parallel bilingual data.
While training with POS classes improves over the random baseline, training with translation equivalence classes gives even better performance. For both approaches, using more embedding features improves the performance (500 dimensions did not improve significantly over 300). Our model is generally better than Berg-Kirkpatrick et al. (2010) -but worse than .

Cross-language super sense tagging
Finally, we also tried using the BARISTAembeddings for English-Danish (with parameters still set on Spanish POS) on another task, namely super sense tagging (Ciaramita and Altun, 2006). We train a system on a mixture of 1000 randomly sampled sentences from English SemCor 7 and 320 labeled Danish sentences (see below) and compare using bilingual embeddings trained with equivalence classes from English and Danish wordnets (WN-300), to embeddings trained using translation equivalence classes (Tr-300). We use 300 dimensional embeddings. We use a POS-sensitive most frequent sense baseline (MFS), as well as structured perceptron model trained only with ortographic and POS features, as well as MFS features (Johannsen et al., 2014). Our metric is a weighted average over F 1 -scores for the (41) semantic classes. Note that using the knowledge base is superior to using translation equivalences, but both embeddings are superior to both our baselines. The Danish training (newswire only) and test data (six different domains) -is made publicly available.

Conclusions
We presented a simple approach, BARISTA, to learning bilingual embeddings. BARISTA has the advantages that it (a) is independent of the choice of embedding algorithm, (b) does not require parallel data, and (c) can be adapted to specific tasks by using appropriate dictionaries. Our embeddings proved useful for cross-language POS/SuS tagging.