Morphologically Aware Word-Level Translation

We propose a novel morphologically aware probability model for bilingual lexicon induction, which jointly models lexeme translation and inflectional morphology in a structured way. Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning, while inflectional morphology provides additional syntactic information. This approach leads to substantial performance improvements—19% average improvement in accuracy across 6 language pairs over the state of the art in the supervised setting and 16% in the weakly supervised setting. As another contribution, we highlight issues associated with modern BLI that stem from ignoring inflectional morphology, and propose three suggestions for improving the task.


Introduction
The ability to generalize to rare and unseen morphological variants of a known word lies at the heart of translation. For instance, a capable human Spanish-English translator would find translating the exceptionally rare form tosed (2 nd person plural imperative form of 'to cough') as straightforward as translating the infinitive toser-despite the fact that tosed is so infrequent that many native Spanish speakers may have never encountered the form themselves. Given how basic this generalization ability is for humans, one should expect a good bilingual lexicon inducer to exhibit a similar capacity to generalize. In other words, the model should translate infrequent, regular forms of a lexeme as accurately as it translates a lexeme's most common forms.
Nevertheless, current approaches to bilingual lexicon induction (BLI) fall short of this desideratum. In a recent study, Czarnowska et al. (2019) reveal that the performance of state-of-the-art bilingual lexicon inducers degrades severely when translating less frequent inflected forms-even for the most common lexemes. The problem is severe: In the case of inducing a French-Spanish bilingual lexicon, the model of  correctly translates infinitives 50.6% of the time, but correctly translates the 2 nd person plural imperative forms only 1.5% of the time. Motivated by this disparity, this work introduces a novel morphologically aware probability model for BLI that jointly models lexeme translation and inflectional morphology in a structured way. Our model exploits the basic linguistic intuition that the lexeme is the core lexical unit of meaning, while inflectional morphology provides additional syntactic information on top of it (Haspelmath and Sims, 2013). It follows that we should ignore this syntactic information when translating at the word level and handle morphological inflection with a different component of the model.
The empirical portion of our paper describes experiments on French, Italian and Spanish. We find our joint model substantially improves over several strong baselines on the BLI task. When evaluating on held-out lexemes, 1 we observe an average performance improvement of 19% and 16% over the previous state of the art (Artetxe et al., 2018b) in the supervised and weakly supervised settings, respectively. In addition, we propose a simple heuristic to further boost performance: Inspired by the dual-route hypothesis (Pinker and Prince, 1994;Pinker, 1998) and the network model of morphological processing (Bybee, 1985;Bybee, 1995b), we translate high-frequency forms, which are most likely to exhibit irregularity, directly (without going through the lexeme) and reserve our morphologically aware model for low-frequency forms. This heuristic gives us a further 2% improvement.

Morphological Inflection
Morphological inflection is the systematic alteration of the word form that adds specific morpho-syntactic information to the lexeme, e.g. tense, case and number. English is weakly inflected with only a few forms per lexeme, and, in that respect, it differs from many other languages. A higher degree of inflection is indeed the norm among the world's languages (Dryer and Haspelmath, 2013). In this work, we distinguish the terms lexemes, inflected forms and lemmata. A lexeme is an abstract concept that represents the core meaning shared by a set of inflected forms. An inflected form is an individual morphological variant that belongs to a given lexeme. As an example, the lexeme RUN has the inflected forms run, runs, ran, and running. The lemma, also called the citation form, is an inflected form that lexicographers have chosen to be representative of the lexeme. For example, the lexeme RUN's lemma is run. In many languages, the infinitive is the verbal lemma and the nominative singular is the nominal lemma. We consider a lexicon of a language to be a set of inflected forms. 2

Bilingual Lexicon Induction
In the NLP literature, the BLI task is to translate a given list of source-side word forms into the most appropriate corresponding target-side word forms. It dates back to 1990s. The first data-driven experiments on parallel corpora made use of word-alignment techniques (Brown et al., 1990;Kupiec, 1993;Smadja and McKeown, 1994). Such approaches were later extended to operate on non-parallel or even unrelated texts by leveraging the correlation between word co-occurrence patterns in different languages (Rapp, 1995;Fung and Lo, 1998;Fung, 1998;Koehn and Knight, 2002). Apart from the distributional signal, the early approaches make use of other monolingual clues, e.g. word spelling, cognates or word frequency. 3 More recent approaches leverage the distributional signal in word embeddings without any explicit linguistic clues. Many current models (Mikolov et al., 2013; learn a linear transformation between two monolingual word embedding spaces, often guided by an initial set of seed translations. This seed dictionary frequently spans several thousand word pairs (Mikolov et al., 2013;Xing et al., 2015;Artetxe et al., 2016) but one can also provide weaker supervision, through listing only identical strings or shared numerals (Artetxe et al., 2017;. For unsupervised BLI, the initial translations may also be induced automatically through exploiting the structure of the monolingual embedding spaces (Zhang et al., 2017;Conneau et al., 2018;Artetxe et al., 2018b). We focus on supervised and weakly supervised BLI which outperform unsupervised approaches . The BLI models are typically evaluated using the precision@k metric, which tells us how many times the correct translation of a source form is among the k-best candidates returned by the model. In this work we exclusively consider the precision@1 metric, which is the least forgiving.

Morphological Inflection: A Challenge for BLI
Most datasets for BLI operate at the level of inflected forms and impose no restriction on the morphosyntactic category of translated words. From a lexicographer's standpoint, this choice is unusual. Dictionaries generally only list lemmata; inflected forms are rarely listed. Thus, BLI is not the task of inducing a dictionary in the strict sense of the word. The authors have found this a common misconception in the literature and among NLP researchers.
Despite the assumption that inflectional morphology is present in the lexicon, most BLI datasets list only a handful of inflected forms per lexeme. This is due to frequency restrictions imposed when the datasets are created and the consequences can be quite severe; Czarnowska et al. (2019) reveal that the MUSE test dictionaries (Conneau et al., 2018) for Romance language pairs cover, on average, only 3% of paradigms for verbs they contain. In contrast, we assert that BLI models should be trained and evaluated on datasets that contain a more representative range of morphological inflections. We use the term morphologically enriched dictionary for such bilingual lexicons (see §5.1).
To our knowledge, we are the first to explicitly model inflectional morphology in BLI. Closest to our endeavor,  address morphology in BLI by incorporating grammatical information learned by a pre-trained denoising language model, while Riley and Gildea (2018) enhance the projection-based approach of Artetxe et al. (2017) with orthographic features to improve performance on BLI for related languages.

A Joint Model for Morphologically Aware Word-level Translation
The primary contribution of this work is a morphologically aware probabilistic model for word-level translation. Our model exploits a simple intuition: Because the core unit of meaning is the lexeme, one should translate through the lexeme and then inflect the word according to the target language's morphology.
Notation. In the task of BLI, we consider a source language s and a target language t. The goal is to translate inflected forms in a source language ι s ∈ L s to inflected forms in a target language ι t ∈ L t , where L s and L t are the source-side and the target-side lexicons, respectively. We denote source-side lemmata as λ s ∈ L s and target-side lemmata as λ t ∈ L t . We use τ s ∈ T s for the morpho-syntactic tag of the source form ι s and τ t ∈ T t for the tag of the target form ι t , where T s and T t are the sets of possible tags in the source and the target language. Finally, let e : L → R N be a function which takes as input a word and returns its pre-trained monolingual word embedding.
Model. We construct a joint probability model for morphologically aware word-level translation. Using the notation defined in the previous paragraph, we formally define the model as where π t is a set of valid target-language translations. The joint distribution is factorized into three parts: a synthesizer, a translator and an analyzer. In the next three sections, we define each of these distributions. Note that the model, as defined in eq. (1), can provide a set of valid translations π t . For languages with very similar morphological systems, this set will often have one element (with the same morpho-syntactic description), while for more distinct languages it will contain a number of inflected forms.

The Synthesizer:
The synthesizer produces a set of valid target-side inflected forms and their tags π t given a target-side lemma λ t and a source-side tag τ s . We formally define this distribution as follows: The joint distribution over forms and tags is factored into two parts. The first part, the inflector, produces an inflected form ι t given a lemma λ t and a morphological tag τ t . This problem has been well studied in the NLP literature (Cotterell et al., 2016;Cotterell et al., 2017). The second part, tag translator, determines the possible target-side morphological tags that are compatible with the features present in the source tag.
In principle, our model is compatible with any probabilistic inflector. In this paper, we employ the model of Wu et al. (2019b), which obtained the single-model state of the art at the time of experimentation (McCarthy et al., 2019). The model has a latent character-level monotonic alignment between the source and target inflections that is jointly learned with the transducer and is, in effect, a neuralized version of a hidden Markov model for translation (Vogel et al., 1996).
In this work we focus on closely related languages and make a simplifying assumption that there exists a single most-plausible translation for each inflected form. 4 We formalize the tag translator using an indicator function: For experiments with more distant language pairs one can define p(τ t | τ s ) to be a multi-label classifier.
3.2 The Translator: p(λ t | λ t ) As our translator, we construct a log-bilinear model that yields a distribution over all elements in the target lexicon. We assume the existence of both source-and target-side embeddings. For notational simplicity, we use e to define the embedding function for both the source and the target language, although in practice these look-up functions are distinct. The model has a single matrix of parameters: Ω ∈ R Nt×Ns where N s is the source embedding dimensionality and N t the target embedding dimensionality. Our translator is defined as the following conditional model where the normalizer is defined as Note that this log-bilinear model differs from most embedding-based bilingual lexicon inducers which predict embeddings, rather than words. For example, 's approach contains a multivariate Gaussian over the target-language's embedding space. 5 Orthogonal Regularization. During training we employ a special regularization term on the parameter matrix Ω. Specifically, we use with a tunable "strength" hyperparameter α ∈ R ≥0 . This term encourages the translation matrix to be orthogonal, which has led to consistent gains in past work (Xing et al., 2015;Artetxe et al., 2016;. 3.3 The Analyzer: p(λ s , τ s | ι s ) For our probabilistic analyzer we use the same hard attention model as in the inflector. The model predicts both the lemma and the morphological tag; the output is a morphological tag followed by a special end-of-tag character and a sequence of lemma characters.
directly, using only the translator component. For example, the Spanish form pide-an irregular inflected form of pedir ('to ask for')-would be translated directly, since pedir is less frequent. In a broader context of morphological processing, different handling of a form depending on its frequency or regularity can be linked to the dual-route hypothesis (Pinker and Prince, 1994;Pinker, 1998), which posits that regular and irregular inflection are handled by different cognitive mechanisms, or the works of Baayen (1992), Baayen (1993) and Hay (2001), which have loosely inspired this heuristic. 7

Experimental Setup
Our evaluation involves 3 Romance languages which exhibit a higher degree of inflection and are commonly experimented on within BLI-French, Spanish and Italian. Because these languages come from the same branch of the Indo-European family, the results serve as an empirical upper-bound on BLI. Czarnowska et al. (2019). As discussed in §2.3, given that inflectional morphology is present in the induced lexicon, BLI models should be trained and evaluated on datasets which list a range of compatible inflected form pairs for every source-target lexeme pair. At this time, the dictionaries of Czarnowska et al. (2019) are the only publicly available resource that meets this criterion, and, for this reason, they are the most important evaluation benchmark used in this work. The dictionaries were generated based on Open Multilingual WordNet (Bond and Paik, 2012), Extended Open Multilingual WordNet (Bond and Foster, 2013) and UniMorph 8 McCarthy et al., 2020), a resource comprised of inflectional word paradigms for 107 languages. The dictionaries only list parts of speech that undergo inflection in either the source or the target language; these are nouns, adjectives and verbs in the Romance languages. Conneau et al. (2018). MUSE (Conneau et al., 2018) was generated using an "internal translation tool" and is one of the few other resources which covers pairs of Romance languages. However, it is skewed towards most frequent forms: The vast majority of forms in MUSE are ranked in the top 10k of the vocabularies in their respective languages, causing it to omit many morphological variants of words. The dataset also suffers from other issues, such as a high level of noise coming from proper nouns (Kementchedjhieva et al., 2019). Thus, we do not view this resource as a reasonable benchmark for BLI. Artetxe et al. (2016). They learn an orthogonal linear transformation matrix between the source language space and the target language space, after length-normalizing and mean-centering the monolingual embedding matrices. Their method is fully supervised and works best with large amounts of training data (several thousand translation pairs). . They introduce a weakly supervised, self-learning model, which can induce a dictionary given only a handful of initial, seed translations. This is achieved by iteratively alternating between two steps: an inflected form alignment step and a mapping step. The first is comprised of finding a matching in a bipartite weighted graph in which source forms constitute one set and target forms the other set. During the mapping step the resulting alignment is used to learn a better projection from the source language space into the target space-this is done by solving the orthogonal Procrustes problem. 9 Artetxe et al. (2018b). They propose a fully unsupervised approach to BLI. The starting point for their model is an automatic initialization of a seed dictionary which exploits the structural similarity of the monolingual embeddings. Like , their model also utilizes self-learning, but learns mappings in both directions and employs a range of additional training techniques. Despite being unsupervised, this model constitutes the state of the art on many BLI datasets (e.g. 's and Artetxe et al. (2018a)'s), outperforming even the supervised approaches. Thus, we include it in our evaluation and compare it to supervised and semi-supervised approaches.

Skyline
We also consider a version of our model which uses an oracle analyzer-the source lemma λ s and tag τ s are known a priori. The skyline provides an upper-bound of performance-to wit, what performance would be achievable if the model had had access to more information about the translated source form.

Experimental Details
We implemented all models in PyTorch (Paszke et al., 2019), adapting the code of Wu et al. (2019b) for the transducers (analyzer and inflector). Throughout our experiments we used the Wikipedia FASTTEXT embeddings (Grave et al., 2018), which we length-normalized and mean-centered before training the models. As is standard, we trained all translators on the top 200k most frequent word forms in the vocabularies of both languages. To evaluate on very rare forms present in the dictionaries of Czarnowska et al. (2019) which are out-of-vocabulary (OOV) for FASTTEXT, we created an OOV FASTTEXT embedding for every OOV form that appears in a union of WordNet and UniMorph and appended those representations to the original embedding matrices. 10 We evaluated all models using precision@1 as a metric, which is equivalent to accuracy. At evaluation, for all models we used cosine as a measure of similarity between two word embeddings. 11 Estimating the Model. We estimate the parameters of our models to maximize the log-likelihood of the training data. In the supervised case, we are able to estimate the parameters of the different components of the model independently. For every language pair we trained a separate translator, on the initial seed dictionary, as well as a separate analyzer and inflector on UniMorph entries-in the source language for the analyzers and the target language for the inflectors. Importantly, to ensure that the transducers are never trained on forms they will see at evaluation, we excluded the entries present in the test or development split of the considered evaluation dataset. 12 For inflection, we make the assumption that there always exists one unique inflected form for a given lemma and tag. However, in the case of analysis, due to syncretism, there often exists a number of plausible interpretations. As a result, in the training data there might be a number of correct lemma-tag combinations for every input form. At training, we select only one of those possible analyses. We found that this approach works better than training the model on all possible analyses as targets. As a consequence, the analyzers might be biased towards specific morpho-syntactic interpretations of syncretic forms, which, down the line, may hurt the performance of our model. Indeed, we view this as a trade-off between having a more accurate but biased analyzer that limits the possible translations produced by the model, and a more noisy analyzer which can result in more varied translations.
Hyperparameters. The models were trained with Adam (Kingma and Ba, 2014), using an initial learning rate of 0.001 for the transducers, and 0.05 for most translators (see §A for a detailed breakdown). We halved the learning rate after every epoch for which the development loss increased and utilized early stopping (with the min. learning rate of 1 × 10 −8 ). For the transducers we also applied gradient clipping with a maximum gradient norm of 5. For , we set the number of candidate target words considered for each source word during matching to 15 and constrained the matching to consider only 40k most frequent forms. Decoding. We decode the model with greedy search, i.e. beam search with a beam size of 1. In all experiments we return the single most suitable form.
Data Requirements. As for other projection-based BLI approaches, the translator component needs to be trained on monolingual embeddings and an initial seed dictionary, which can be generated automatically.
The only additional resource required to train the full model is UniMorph, or a similar morphological database, which is used to train the transducers. Although this extra requirement could be a limitation in some cases, such morphological lexicons are available in an increasingly large number of languages because they are a by-product of descriptive linguists' efforts to document the world's languages.

Results
We consider the fully supervised and the weakly supervised setting. Weak supervision in the case of BLI refers to populating the seed dictionary with identically spelled strings . In Table 1, we present our experimental results on the whole evaluation dictionary, including the words which are out-of-vocabulary for FASTTEXT (ALL), as well as on the in-vocabulary forms only (VOC). When evaluated on the morphologically enriched resource, our proposed approach leads to substantial performance gains over the baseline models, on every language pair. In the supervised setting, for the in-vocabulary words we note an average 3% improvement in the case of our base model and a 8% improvement in the case of our hybrid model over the best performing baseline (Artetxe et al., 2016). In the full-vocabulary experiments, the improvements reach 18% for our base model and 20% for the hybrid.
In the weakly supervised setting, the performance gains remain similarly high-for our base model they reach 2% and 16%, respectively, compared to the best baseline (Artetxe et al., 2018b), while for the hybrid model we observe a 6% and 18% improvement, respectively. Our experiments with an oracle analyzer  demonstrate that even larger gains are possible: the oracle consistently outperforms every other model across all evaluation conditions. When evaluated on MUSE (Conneau et al., 2018), the hybrid model is competitive with the baselines, while our base model performs worse. This is not surprising given that MUSE contains only the most frequent inflected forms which, as discussed in §4 and as we demonstrate later in this section, are the weakness of our base model. The hybrid model can also suffer from misanalyzing frequent irregular forms of even more frequent lemmata, such as puede or pudo-both forms of Spanish poder ('to be able to').
Frequency Breakdown. In Figure 1 we present the performance breakdown across 10 frequency bins for our base model, its hybrid version, the oracle and the model of Artetxe et al. (2018b) for an example language pair: Italian-Spanish. The plot displays results for the supervised setting, for the dictionary of Czarnowska et al. (2019). 13 As per the plot, our method proves to be particularly beneficial for translating forms of medium to low frequency; in contrast to the baseline model, for which the performance continuously drops as the forms become less frequent, the performance of our models drops initially, but then plateaus at around 40% accuracy (for the in-vocabulary forms). Figure 1 also exposes a weakness of our base model-it leads to substantial performance drop for the two highest frequency bins. Notably, the hybrid model does not suffer from this limitation. Indeed, the fact that the shape of the hybrid's plot closely resembles that for the oracle suggests the heuristic we used successfully identifies irregular forms, which are hard to analyze. Table 2 we present the translation accuracy on French-Spanish, in the supervised setting, for the model of Artetxe et al. (2018b) and the hybrid model across a range of different morpho-syntactic categories; the models are only evaluated on source forms belonging to a particular paradigm slot. We observe that our approach leads to improvements on all but two verbal paradigm slots, while for the adjectives and nouns the performance is competitive with the baselines. For many categories the improvements are very substantial; e.g., for the in-vocabulary 2 nd person plural imperative forms of a verb (V;POS;IMP;2;PL) and 2nd person plural present subjunctive forms (V;SBJV;PRS;2;PL) the accuracy improves by 55%.

A Critique of Modern BLI
The model we develop in §3 stems from a desire to better integrate inflectional morphology into current state-of-the-art models for BLI. On one hand, the empirical findings we discuss in §6 indicate this attempt Morphology (Artetxe et al., 2018b)   was a success, but, on the other, our more nuanced conclusion is that the task of BLI, as currently researched in NLP, is ill-defined with respect to inflectional morphology. Indeed, the authors suggest that BLI needs redirection going forward. The recent trend in BLI research is to remain data-driven and to avoid specialist linguistic annotation. Current projection-based approaches to BLI depend heavily on the assumption that the lexicons of different languages are approximately isomorphic (Mikolov et al., 2013;Miceli Barone, 2016). However, given the immense variation in morphological systems of worlds' languages, this assumption is prima fascie false. Consider the simple contradiction of Spanish and English, where the first exhibits much more morphological inflection than the latter; there can be no one-to-one alignment between the words in those two lexicons. The failure of the isomorphism assumption has been discussed and addressed in many recent works on cross-lingual word embeddings Nakashole and Flauger, 2018;Ormazabal et al., 2019;Patra et al., 2019). However, none of those studies directly target inflectional morphology. In this work we highlight that inflectional morphology complicates BLI and NLP researchers should strive to develop a cleaner way to integrate it into their models. We contend the models we present make progress in this direction but there is still a long way to go. We now make three concrete suggestions for BLI going forward. The first two involve engaging with morphology more seriously and are extensions to the ideas in this paper. The third focuses on backing away from morphology.
More Fine-Grained Lexicons. Our first proposal is to create more elaborate morphological dictionaries, in the style of those by Czarnowska et al. (2019). The primary draw-back of this suggestion is that such resources are tedious to create. Czarnowska et al. (2019) focus on genetically related languages, for which inflectional morphological systems are easily compatible. However, this is often not the case. Every morphological system carves up the semantic space in its own way, e.g. there is no good German equivalent of Polish verbal aspect. Thus, such elaborate resources should be carefully crafted and specify lemmata, inflected forms and tags for both the source and the target language. It follows, modulo polysemy and other lexical ambiguity, that the task would be well defined.
Contextual Word-level Translation. Another suggestion for future work is to contextualize word-level translation. Indeed, translation of syntactic features without context is somewhat unusual-e.g. in some contexts a feminine adjective in Spanish might not be translated as feminine in Italian, because a feminine noun in Spanish might be masculine in Italian. Arguably, much of the morphological ambiguity (and lexical ambiguity) present in modern BLI can be resolved by a word's source-side context. However, identifying the amount and type of context sufficient to disambiguate possible translations is non-trivial. An additional point is that, in this scenario, BLI starts to approach full machine translation.
Lexeme-Level Translation. The final suggestion is for the task of BLI to ignore morphology. This would mean filtering the training and test lexicons to ensure that only lemmata exist. Such filtering can be performed at the type level with a list of valid lemmata. This third suggestion would entail a return of BLI to the spirit of the task: the induction of a bilingual dictionary. However, creating a lemma-only resource requires additional language-specific knowledge.  Table 3: The optimization and hyperparameter settings for the analysis and inflection modules across all language pairs (above) and for the translator module (below). α is the orthogonal regularization weight. During training we divide the values of α by the batch size to match the weight of the training loss (the losses are averaged across observations for each mini batch).   Table 4 displays the accuracy of the analysis and inflection components on the development set of the dictionaries of Czarnowska et al. (2019). For most language pairs the performance of the inflectors is in the high 90s, while the performance of the analyzers averages at 90.7%. Indeed, the word-type analysis is inherently more difficult than inflection, as it is less deterministic: in most cases there exist more than one correct lemma-tag output. Note that this task is different from the word-token analysis, where the context of the analyzed inflection is known. In addition, in contrast to inflection where the model is given the morpho-syntactic tag, the analyzer has no information about the type of word it is handling. In particular, in an additional line of experimentation, we found that adding a POS to the analyzed source form notably improves the performance. Similarly, training and evaluating on forms associated with a single POS also leads to better accuracy.