Pre-tokenization of Multi-word Expressions in Cross-lingual Word Embeddings

Cross-lingual word embedding (CWE) algorithms represent words in multiple languages in a uniﬁed vector space. Multi-Word Expressions (MWE) are common in every language. When training word embeddings, each component word of an MWE gets its own separate embedding, and thus, MWEs are not translated by CWEs. We propose a simple method for word translation of MWEs to and from English in ten languages: we ﬁrst compile lists of MWEs in each language and then tokenize the MWEs as single tokens before training word embeddings. CWEs are trained on a word-translation task using the dictionaries that only contain single words. In order to evaluate MWE translation, we created bilingual word lists from multilingual WordNet that include single-token words and MWEs, and most importantly, include MWEs that correspond to single words in another language. We show that the pre-tokenization of MWEs as single tokens performs better than averaging the embeddings of the individual tokens of the MWE. We can translate MWEs at a top-10 precision of 30-60%. The tokenization of MWEs makes the occurrences of single words in a training corpus more sparse, but we show that it does not pose negative impacts on single-word translations.


Introduction
Cross-lingual word embeddings (CWEs) are realvalued vector representations of words in multiple languages placed in a shared vector space, with the intention that words with closer meanings have closer locations in the vector space. First, monolingual word embeddings are trained based on the hypothesis of distributional semantics (Harris, 1954) that context approximates meaning. They are learned from data in a way that words used in similar contexts have similar vectors. Following that, the monolingual word embeddings are aligned to produce CWEs. CWEs are an essential building block in modern cross-lingual methods and can also be used to induce bilingual lexicons from a small seed dictionary (Mikolov et al., 2013).
An important and overlooked fact is that before CWEs are trained, the corpus is pre-processed by a word tokenizer. This illustrates a clear limitation of the state-of-the-art CWEs: they can only align words that happen to be considered as single tokens by the word tokenizer.
Multi-word expressions (MWEs) are combinations of orthographic words, whose meaning, form, use, or distribution is non-compositional or unpredictable in some way (Sag et al., 2002;Baldwin and Kim, 2010). They come in diverse forms such as compound nouns (dance floor), named entities (United States), phrasal verbs (give up), and connectives (as well as). Word tokenizers do not recognize MWEs as single units but rather as a sequence of their components, a deficiency carried into CWE construction.
In this position paper, we argue that the token units of word embeddings should be discussed more carefully, and, in particular, that MWEs should be recognized as single units before training and evaluating word embeddings. In cross-lingual applications, MWEs are particularly important. A single token in one language is often translated into an MWE in another language. So, failure to tokenize MWEs is a critical flaw of CWEs in the task of word translation and presumably in other cross-lingual tasks as well.
Some studies (Iyyer et al., 2015;Shen et al., 2018) have suggested representing phrase and sentence embeddings by taking the average or sum of their component word vectors. However, such a simple approach is not sufficient, as the meaning 联合 美国 州 average Figure 1: The effect of MWE tokenization in crosslingual alignments (Table 1). English word embeddings trained with single-word tokenization (2) do not have united states in the vocabulary, and we represent its embedding by the average embedding. Word embeddings with MWE tokenization ( ) assigns a unique embedding to united states, which is better aligned with its Chinese translation 美国. Note that the configuration of single-word embeddings also changes by having MWE embeddings.
of an MWE is often unpredictable from its components, as in red tape and hot dog. Instead, MWEs should be explicitly modeled during CWE training. To illustrate the advantage of having MWEs in the CWE vocabulary, we compare the alignments of English-Chinese CWEs with and without MWE tokens ( Figure 1). Table 1 shows cosine similarities of English and Chinese words united and states. The numbers on the left side of each arrow (Single) show the cosine similarities between English and Chinese embeddings trained with standardly pretokenized corpora. As the English MWE United States is not in the vocabulary, we made an embedding for it by taking the average of the vectors of united and states. In contrast, we obtained the cosine similarities on the right-hand side of each arrow (+MWE) by combining United States into one token before training word embeddings.
With MWE-based tokenization, the single token united states aligns with 美国 (United States; meiguo) with a high cosine similarity of 0.82. The pre-tokenization of United States into a single token solves additional problems as well. When we treat United States as two separate tokens, we distort the embeddings of united and states. On the left sides of the arrows in Table 1, both united and states have a much higher cosine similarity to 美国 than to their correct translations. Also, united and state have a higher cosine similarity to each other than they should. Recognizing United States as one token before training word embeddings makes it possible to translate a single token to/from an MWE and ameliorates the alignments In this study, we employ a simple method to identify MWEs in corpora by using MWE dictionaries instead of automatic detection. Despite the rich body of work (Constant et al., 2017), including methods developed in specialized shared tasks (Schneider et al., 2014;Savary et al., 2017;Ramisch et al., 2018), automatic MWE detection is still a hard problem (Savary et al., 2019). Ramisch et al. (2012) tested several unsupervised discovery methods and reported that they performed poorly in terms of either precision or recall.
A lexicon-based approach to MWE detection comes with another advantage. Supervised methods for MWE detection require annotated texts (Constant et al., 2017), which may not be available for all languages. On the other hand, the high availability of lexical resources containing MWEs in many languages, such as Wiktionary and WordNet, makes a lexicon-based approach for MWE detection possible in many languages.
Our focus in this paper is not to study the automatic extraction of MWEs, but rather to establish that tokenization of MWEs can contribute to improvements in CWE. Since MWE lexicons exist for the languages we are interested in, we have used those for the time being. Of course, using automatically discovered MWEs would be an interesting direction for future research.
To explore the effect of pre-tokenization of MWEs, we evaluate CWEs in the task of word translation between English and 10 languages, Arabic), Bulgarian, Chinese, German, Hebrew, Hindi, Japanese, Russian, Spanish, and Turkish, which span a wide typological variety. We find that our simple lexicon-based tokenization can align embeddings of MWEs at a precision@10 score of 30-60% 1 The reason for the lowering of the cosine similarity between English and Chinese embeddings of states would be the fact that the English word states is polysemous while the Chinese word states almost exclusively means regional states. After the pre-tokenization of MWEs, English states no longer appears as the component of united states, so its distribution would be dissimilar to that of regional states. without negative impacts on single word translation. Furthermore, we find some single-token words are correctly translated into MWEs, which are not attested in the common evaluation practice.
In summary, we argue that CWE studies should consider MWEs in development and evaluation. MWEs are pervasive in many languages and should not be ignored when the alignment of words is discussed. We present a lexicon-based method to this end ( §3-4) and show its effectiveness in the task of word translation ( §5). We have created a new word translation dataset that contains MWEs ( §3.2). The dataset is in ten language pairs and contains MWEs in addition to single orthographic tokens. 2 2 Related Work

Cross-lingual Word Embeddings
In this study, we experiment with one of the major approaches of learning CWEs, where monolingual embeddings trained in each language are mapped using cross-lingual supervision. Early work by Mikolov et al. (2013) showed that a linear transformation of word embeddings across languages can be trained by a bilingual dictionary. Smith et al. (2017) reported that the linear mapping becomes more accurate and computationally efficient by setting an orthogonal constraint on a transformation matrix. Recent studies (Artetxe et al., 2017;Zhang et al., 2017;Conneau et al., 2018) have further demonstrated that a transformation matrix can be learned by a very small amount of seed translations and even without any supervision.
Another stream of studies on CWEs adopts a joint approach: word embeddings on multiple languages are trained at one time using parallel corpora (Luong et al., 2015;Gouws et al., 2015). It is an interesting future direction to explore how MWEs affect joint detection of CWEs.

The limitations of CWEs
Besides the problem of word units, several limitations of CWEs have been pointed out in the literature. The majority of such work focuses on the statistical characteristics of word embeddings rather than their linguistic nature. Some studies (Søgaard et al., 2018;Ormazabal et al., 2019) claim that the accuracy of cross-lingual alignments depends on the similarity of word embeddings spaces of different languages, and this similarity in turn depends on the similarity between the training corpora. Kementchedjhieva et al. (2019), illustrating an issue related to evaluation of CWEs, argues that proper nouns constitute a quarter of the MUSE dataset, rendering it not ideal for word translation. Using a word translation task for the intrinsic evaluation of CWEs presupposes a correlation between its performance with the performance of CWEs in downstream tasks, which has been questioned by several studies. Ammar et al. (2016), Glavaš et al. (2019) and Fujinuma et al. (2019) show low correlation between word translation accuracy and the performance of downstream tasks such as document classification, natural language inference, and dependency parsing. A specific problem may be that underfitting to the training data in order to better handle unseen words in the test set hinders downstream tasks that rely on words from the training dictionary (Zhang et al., 2020). In this study, we primarily examine the transferrability of MWEs in a word translation task, although it is possible that the better treatment of MWEs is also effective in downstream tasks.

Multi-word Expressions
MWEs have been studied in the context of syntactic analysis (Rosén et al., 2016;Kahane et al., 2017) and semantic analysis (Tratz and Hovy, 2010;Cordeiro et al., 2019). The discovery and identification of MWEs in corpora are important problems in this area (Sag et al., 2002), and much effort has been devoted to the development of methods (Constant et al., 2017) and annotated resources (Losnegaard et al., 2016). The universal dependencies (UD) project (Nivre et al., 2016) covers a wide range of languages but uses just a few dependency relations to annotate MWEs, namely fixed, flat, and compound. The DiMSUM shared task (Schneider et al., 2016) aims to detect English MWEs in texts. The PARSEME project (Savary et al., 2017;Ramisch et al., 2018) targets verbal MWEs and has constructed benchmark datasets in several-mostly European-languages for training automatic MWE taggers. However, such training resources are available only in a limited number of languages, and even with such resources, the automatic analysis of MWEs is known to be very difficult. Savary et al. (2019) argues the importance of syntactic MWE lexicons for further development in this area.
Another line of work analyzes the interpretation of MWEs such as noun compounds (Tratz and Hovy, 2010). Some studies exploit word embeddings to build a classifier (e.g., Shwartz and Waterson, 2018). Several studies tokenize MWEs before training word embeddings (Baldwin et al., 2003;Salehi et al., 2015;Cordeiro et al., 2019). Although the major target of these studies is monolingual, our focus is on the cross-lingual mapping of MWEs by CWEs.

Data Creation
This section describes the methods we used for creating the data that we are releasing with this paper: (1) monolingual lists of MWEs in eleven languages for pre-tokenizing MWEs in corpora and (2) bilingual dictionaries (ten languages each paired with English) for evaluating the resulting MWE embeddings in the word translation task. The languages are Arabic (ar), Bulgarian (bg), Chinese (zh), English (en), German (de), Hebrew (he), Hindi (hi), Japanese (ja), Russian (ru), Spanish (es), and Turkish (tr)

Monolingual MWE Lists for Pre-tokenization
For each of the eleven languages, we compiled a list of MWEs from publicly available resources listed below. We examined each lexical unit in each resource and selected those with multiple tokens. We treat all lexical units that are divided into two or more tokens as MWEs in our study, assuming they are fixed semantic units in some way. eomw: Entries of the Extended Open Multilingual Wordnet (EOMW; Bond and Foster, 2013) consist of a WordNet synset identifier, a language identifier, and a lexical unit in that language. EOMW includes all WordNet synsets and additional synsets drawn from Wiktionary and the Unicode Common Locale Data Repository. 3 Most entries are nominals, but this resource also contains other types of MWEs like verbal phrases and connectives.
parseme: Parseme is multilingual corpus in which Verbal MWEs are annotated for the PARSEME shared task 1.1 (Ramisch et al., 2018). Types of verbal MWEs include light verb constructions (e.g., give a speech), verb-particle constructions (e.g., wake up), verbal idioms, etc. They can be commonly observed in many languages even though 3 We   the category distributions vary from language to language. Table 2 shows the sizes of our lexicons. Note that not all MWEs in our lists are included in our word embeddings as some of them do not exist in our training corpora.

Bilingual Dictionaries for the Word Translation Task
Next, we built bilingual dictionaries that have MWEs for each of the pairs between English and the ten languages. To the best of our knowledge, there is no public benchmark dataset including translations between MWEs. We again used EOMW, linking lexical units in different languages with the same WordNet synset identifiers. We call the resulting bilingual dictionaries EOMW-MWE BENCHMARK, hereafter. In the EOMW-MWE benchmark, source words are all MWEs, while target words could be both single words or MWEs. We limited source words to be MWEs to ensure an MWE is always involved in translation. The number of source words varies in different language pairs. For example, zh-en has the largest number of source (zh) words, 4,813, while hi-en has 274. We report the number of source words in Table 4 ( §5).

Annotation of MWE types
We annotated the 1.5k English MWEs in our bilingual dictionaries for the purpose of error analysis. 4 We manually POS-tagged the English MWEs with the six tags adj (adjective phrases), adv (verbal and clausal adverbs), noun (noun phrases), prep (prepositional phrases), verb (verb phrases) and misc (anything else). We also classified the English MWEs into four categories, synphrase (s), propername (pn), compound (c) and flat+fixed+idiom (ffi). Below we list the definition and a prototypical example for each of the four categories.
synphrase (s) A semantically compositional multi-word entry from EOMW , e.g. cease to be.
proper-name (pn) A MWE that non-deictically refers to a unique or identifiable referent. Most of these are PER, LOC, GPE, or ORG in a simple NER annotation scheme. e.g. Pacific Ocean.
compound (c) We included noun-noun compounds as well as adjective-noun pairs, which are often hard to distinguish from noun-noun compounds. e.g. opera house, nuclear weapon. Most are syntactically endocentric (headed) and semantically endocentric (a hyponym of its head).

flat+fixed+idiom (ffi)
A MWE that is one of the following: (1) A fixed grammaticalized expression that behaves like a function word or adverbial, e.g. that is to say; (2) A verbal idiom (e.g. let loose), verb-particle construction (e.g. hang up) or multi-verb construction (e.g. let go) as defined by PARSEME, and fixed collocation constructions like take a step, make a decision; (3) Any other idiomatic MWE, e.g. bread and butter.
We defined our own categories rather than use an existing annotation scheme. Synphrase was necessary because our dataset contained certain MWEs such as other side, cease to be that are frequent enough to appear in an MWE lexicon but were semantically compositional. We gave proper name its own category (proper-name) because proper names are uniquely nouns unlike other unheaded MWEs, which are dates, complex numerals and foreign phrases that span a wide variety of POS.

Training CWEs: Components
This section describes our pipeline for training CWEs, including the following three steps (Figure 2): (1) identifying MWEs in a corpus, (2) training monolingual word embeddings, and (3) aligning embeddings across languages. 5 Note that some MWEs have multiple possible parts-ofspeech. For example, cross over (noun and verb).

Monolingual MWE Identification
We first prepare a monolingual corpus for training word embeddings for each of the eleven languages included in this study. We take a simple lexiconbased approach to combine MWEs into one token. Suppose we have the tokenized sentence below.
(1) freedom fries was a political euphemism for french fries in the united states .
Using an MWE lexicon which includes french fries and united states, we combine tokens with underscores and obtain the following sentence.
(2) freedom fries was a political euphemism for french fries in the united states .
With this approach we cannot identify MWEs that do not exist in the lexicon like freedom fries, but there is an advantage: we do not need an annotated corpus of MWEs. Such corpora are difficult to obtain in more than a few languages.
Based on the lexicons that we compiled for each language ( §3.1), we tokenize MWEs in a corpus with mwetoolkit3 (Ramisch, 2015). To increase the recall, we use lemmas for string matching. 6 We do not consider discontinuous MWEs.

Monolingual Word Embeddings
We train monolingual embeddings on tokenized texts with off-the-shelf word embedding algorithms. We adopt fastText with CBOW (Bojanowski et al., 2017). MWEs processed in the previous step are treated as one token and given an individual vector. For example, french fries has a different vector from those of french and fries.

Cross-lingual Mapping of Embeddings
Now we take two sets of word embeddings from two different languages and align the source embeddings to the target embeddings using an existing supervised method based on a bilingual dictionary. Suppose we have n pairs of source and target words. We denote the embeddings of those words X ∈ R n×d and Y ∈ R n×d , respectively, where d is the dimension of the embeddings. We learn a d × d matrix W so that XW is close to Y in terms of Frobenius norm (Mikolov et al., 2013).  2015) and impose an orthogonality constraint on W , namely W T W = I as this constraint is known to improve the accuracy of word translation. We then refine W using an iterative bootstrapping method proposed by Conneau et al. (2018). Specifically, we produce pseudo translation pairs for training by retrieving nearest neighbors in terms of cross-domain similarity local scaling (CSLS). Finally, we translate all embeddings in the source language into the vector space in the target language by W .

Experiments
To examine the effect of pre-tokenization of MWEs, we conduct the task of word translation between each of the ten languages and English, in both directions. A word embedding in a source language is projected into the embedding space of a target language using a trained linear mapping W ( §4.3). The translation candidates of the source word are retrieved by k-nearest neighbor search in terms of CSLS. The performance is measured by top-k precision (Precision@k). 7 Our evaluation involves two tasks. In the first task, we focused on the translation of MWEs using our new evaluation dictionaries that contain tokenized MWEs ( §3.2). In the second task, we evaluated the translation of single words on the existing benchmark, MUSE (Conneau et al., 2018) to investigate the influence on single word embeddings of pre-tokenizing MWEs.

Corpora
We focus on the translation between en and ten languages: ar, bg, es, de, he, hi, ja, ru, tr, and 7 We used an evaluation script provided with the MUSE dictionary.  zh. These languages represent both Indo-European and non-Indo-European languages with a wide variety of morphological features and have sufficient Wikipedia texts for training embeddings. We report results using two Japanese segmentation schemes, IPADIC (Asahara and Matsumoto, 2000) and UniDic (Den et al., 2008). Both of these break Japanese utterances down into relatively small units, sometimes corresponding to morphemes. For this reason, the Japanese texts we trained on have fewer types than the other languages despite the fact that Japanese is highly agglutinative.
For training monolingual embeddings, we sampled 100M tokens for each language 8 from the publicly available Wikipedia corpora (Ginter et al., 2017), which were automatically annotated with UDPipe. Table 3 shows the corpus statistics. We then used mwetoolkit3 to annotate MWEs. Note that the PARSEME dataset does not cover Arabic 9 , Japanese, Russian, and Chinese.  MWEs in Task 2. For the baseline in Task 1, MWEs are represented by the average of the embeddings of the individual words. We used larger vocabulary sizes (e.g, 300-600k) for the candidate set than typical sizes in related studies (e.g. 200k). We describe the details of implementation and hyperparameters in Appendix C and D.

Task 1: MWE Translation
As a baseline method, we tokenize the corpus without MWEs and represent the embedding of each MWE as the average of the single-word embeddings of its components. The baseline and our MWE embeddings were trained on the same singleword dictionaries. We report results of a word translation task on the EOMW-MWE in Table 4.
Despite the absence of MWEs in training dictionaries, our CWEs aligned English MWEs with their correct translation with Precision@10 as high as 30-60%. Our method clearly outperforms the baseline method in most language pairs. This fact shows the importance of learning MWE embeddings directly from a corpus to establish cross-lingual alignments.
We broke down English-L2 MWEs translation results based on our annotated 1.5k English MWEs ( §3) in Table 5. In terms of MWE types, compound (c) was the easiest category to translate (success rate of 60.22%), and flat+fixed+idiom (ffi), which includes various idiomatic expressions, was the hardest (25.52%). In terms of parts-of-speech of MWEs, it turned out that verbal MWEs were much more difficult to translate (21.01%) than nominal MWEs (48.06%). This is consistent with the observation of the PARSEME shared tasks on verbal MWE identification. Interestingly, the translation of adverbial MWEs was very accurate (40.3%). This may indicate that adverbial/adpositional phrases tend to    be used in similar contexts (i.e., words in specific semantic/grammatical classes) across languages.
In Table 8, we show some correct translations retrieved by nearest neighbor search. While stop words such as "in" and "a" are usually not aligned with significant words, the inclusion of these words in MWEs (e.g., in vain and a bit) establishes meaningful relationships across languages.   Table 6 shows the results of single-word translation on the MUSE benchmark. 12 We excluded MWEs from the embeddings in the target language as the benchmark only contains single words.

Task 2: Single Word Translation
We were concerned that, keeping the amount of training data unchanged, the inclusion of MWEs may decrease single-word performance as it makes the occurrence of single words sparse, and it might degrade the quality of monolingual word embeddings. However, the difference in the performance of the single word translation in the other language pairs was not statistically significant. 13 Our method might align a single word in one language with an MWE in another language, which is not attested in the common evaluation practice. To examine this, we included MWE embeddings in evaluation and observed nearest neighbors. Interestingly, our method retrieved MWEs that are correct translations but absent from the MUSE dictionaries. In particular, we show characteristic examples in English-Japanese (IPADIC) translations in Table 7. The first example illustrates a common construction using -nin (person), which is segmented into two words. The benchmark tends to contain transcriptions of foreign words like shefu as they are often single tokens. The second example shows verbalization, which is again segmented into noun + suru (do). These examples exemplify the limitation of evaluations restricted by single words, and may explain the difficulty of English-Japanese word translations reported in a previous study (Hoshen and Wolf, 2018).

Conclusion
We studied the impact of pre-tokenizing MWEs on cross-lingual alignments of word embeddings. We found that simple lexicon-based tokenizations can align embeddings of MWEs at a high precision without breaking alignments of single-words. We believe our results will motivate researchers to pay more attention to the existence of MWEs and how they are aligned across languages. 2016

A Automatic MWE Discovery
In this study, we compiled MWE lists from existing lexical resources. Although MWEs can also be harvested from corpora without relying on lexical units, we found in our preliminary experiments that unsupervised methods cannot distinguish between MWEs and non-MWE phrases accurately. We tested word association measures based on word co-occurrences (Ramisch et al., 2012).
Method: Given tokenized texts, we extract and filter MWEs as follows: 1. We use syntactic patterns to extract candidates of MWEs. We define the following patterns based on part-of-speech (POS) tags.
2. We count occurrences of MWE candidates and components of them.
3. We calculate association scores of the components of each MWE candidate by Dice coefficients, PMI, and maximum likelihood estimates (Ramisch et al., 2012).
4. We filter MWEs by setting a threshold on the association score.
Ideally, the real MWEs have higher scores, and non-MWE phrases have lower scores. Examining this, however, is not easy. It is very expensive to manually check all the candidates in Step 1. So, in our experiments, we aimed to obtain a rough estimate using the MWE lists we compiled. The phrases in our lists are true positives and should be assigned high association scores. Figure 3 shows the results. The horizontal axis denotes the Dice coefficients calculated in Step 3, and the vertical axis shows the number of MWEs in each bin of Dice coefficients. The orange bars shows the number of MWEs that exist in our eomw+parseme lexicon, which are true positives. This result gives us two important implications.
1. The Dice coefficients are not indicative of MWE-ness. There are many true MWEs among the candidates with very low association scores. For example, the Dice coefficient of french fry was only 0.000173.
2. The distribution of the scores is highly skewed, and it is difficult to set a threshold. If we set a lower threshold, the results contain many false MWEs, and if we set a high threshold, we can only obtain a few MWEs.
We observed very similar results in association measures other than the Dice coefficients.

B Corpus Preprocessing
We trained word embeddings on the sentences collected and torkenized following UD version 2 (Ginter et al., 2017). We lowercased texts as the tokens in MUSE dictionaries are all lowercase. The used OpenCC 14 and simplified Chinese characters. For Japanese, we tokenized plain texts provided  with the tokenized Wikipedia dump by MeCab with IPADIC 15 . We then sampled sentences with 100M tokens or extracted full texts. We used GNU Parallel (Tange, 2018) to speed up the preprocessing.

C Monolingual Word Embeddings
We trained CBOW fastText models of 300 dimensions with the parameters suggested by Grave et al. (2018). Specifically, we set hyperparameters as follows: • Dimension of word embeddings (dim): 300 • Minimum length of char N-gram (minn): 5 • Maximum length of char N-gram (maxn): 5 • Number of epochs (epoch): 10 • We set the other parameters to the default values of the fastText software v0.9.1 16 . Table 9 shows the vocabulary sizes of monolingual word embeddings. Note that the vocabulary sizes of Single are smaller than word type counts listed in Table 3 as we follow the default hyperparameters and set the minimal number of word occurrences for assigning word embeddings to 5.

D Cross-lingual Word Embeddings
We used the supervised algorithms implemented in the MUSE library 17 and VecMap library 18 to align monolingual embeddings. Table 9: Vocabulary sizes of word embedding models. We report the number of MWEs (the left hand side of each slash) for the MWE tokenization.

MUSE:
We normalized word embeddings into unit vectors before training. We set the number of refinements to 1 as most of the bootstrapped word pairs were found in the first iteration. VecMap: We followed the hyperparameter setting used by Artetxe et al. (2018). Table 10 and Table 11 show the results of Task 1 and Task 2 with supervised VecMap, respectively. The precision scores are slightly better than those of the supervised alignment with iterative refinements by Conneau et al. (2018), but the overall tendency is very similar to the result in Section 5. Table 12 shows the result of Task 2 broken down based on the categorizations made by Kementchedjhieva et al. (2019). In some languages, the pretokenization of MWEs improved the translation ac-   curacy of adjective, noun, and verbs (en-de (eomw), en-hi (eomw), es-en, hi-en (eomw)), but it did not in other languages. Overall, there is no clear, interpretable tendency from the results. The inclusion of MWEs in the vocabulary increased the performance of MWE translation without a negative impact on single-word translations.

E Experimental Results
To analyze the statistical significance of results, we used BOOTS 19 and conducted pairwise bootstrapping tests with 1,000 trials.