Neural Transduction for Multilingual Lexical Translation

We present a method for completing multilingual translation dictionaries. Our probabilistic approach can synthesize new word forms, allowing it to operate in settings where correct translations have not been observed in text (cf. cross-lingual embeddings). In addition, we propose an approximate Maximum Mutual Information (MMI) decoding objective to further improve performance in both many-to-one and one-to-one word level translation tasks where we use either multiple input languages for a single target language or more typical single language pair translation. The model is trained in a many-to-many setting, where it can leverage information from related languages to predict words in each of its many target languages. We focus on 6 languages: French, Spanish, Italian, Portuguese, Romanian, and Turkish. When indirect multilingual information is available, ensembling with mixture-of-experts as well as incorporating related languages leads to a 27% relative improvement in whole-word accuracy of predictions over a single-source baseline. To seed the completion when multilingual data is unavailable, it is better to decode with an MMI objective.


Introduction
Translation matrices, i.e. concept-aligned word lists across the world's languages (Buck, 1949;Swadesh, 1950;Swadesh, 1952;Swadesh, 1955;Swadesh, 1971;Nastase and Strube, 2013;, enable several avenues of exploration in computational linguistics and human language technologies. They strengthen word alignment models, which can in turn be useful for machine translation (Garg et al., 2019), robust projection of morphosyntactic information across alignments (Yarowsky and Ngai, 2001), and interlinear glossing. Further, fuller word lists enable neogrammarians to better explore phylogeny and phonology across languages (Hewson, 1973;Lowe and Mazaudon, 1994). This work is motivated by the tremendous capacity for humans to generalize during translation, producing forms for words that have not been seen before. This becomes valuable especially for lowerfrequency words, which may not have been observed in training data but could be inferrable through regular processes such as cognate relationships with related languages (Mulloni, 2007;Beinborn et al., 2013), borrowing from neighboring or other influential languages, and even esoteric features like temporal similarity (Schafer and Yarowsky, 2002;Wijaya et al., 2017) or image similarity (Bergsma and Van Durme, 2011). In this work, we focus on these cognate relationships, because cognates form a large amount of both core vocabulary  and technical language (Mulloni, 2007). 1 Unlike conventional bilingual lexicon induction (Rapp, 1995), we do not wish to limit the predictions to words that have been previously seen in a corpus. Automated methods to induce plausible translations for lexical translation would significantly reduce the human effort needed for both elicitation (Chelliah, 2001) and building machine translation systems for less heavily supported languages. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.
Our approach to the problem of translation matrix completion is a neural model (parameterized as a character-level sequence-to-sequence network) that handles multiple language pairs, along with an objective function that maximizes both forward and backward probabilities. By leveraging both probabilities, we try to maximize the flow of information between both source and target languages, leading to more accurate model predictions. We find that by both leveraging information about a concept's form from related languages and carefully combining language-pair-wise predictions of an unknown target word, we can improve accuracy by 27% relative to our baseline multilingual neural model.

Related Work
The task of translation matrix completion, the filling-out of a universal conceptual inventory, has been approached by three broad classes of methods. The first is to manually construct concept inventories, as in Swadesh (1950) and followup work. The next is to automatically identify cognate relationships, e.g. in word lists (Kondrak, 2001;Wijaya et al., 2017;Jäger et al., 2017) or raw text (Koehn and Knight, 2002). The third, which is our focus, is to generate putative cognates by performing transduction in the form of sound or orthographic shifts. In this vein, Mann and Yarowsky (2001) generate cognates by a pipeline of dictionary lookup and probabilistic orthographic shifts. Mulloni (2007) uses an SVM to perform cognate generation. Ciobanu (2016) uses a CRF with reranking to the same end. Beinborn et al. (2013) and  perform translation matrix completion with extracted cognate lists in 6 and 60 language families respectfully, using character-level statistical machine translation systems trained on separate source-target language pairs.  performed the same cognate transliteration task with a multi-source multi-target character-level variant of Johnson et al. (2017).
We adopt the single system multilingual setup of , which allows sharing information across language pairs. We also take inspiration from recent successes in other generation tasks. Nishimura et al. (2018)'s multi-source missing data problem used multiple encoders and a single decoder to leverage multiple source language inputs, which we build on to employ the multiple sources simultaneously during inference. Further, we introduce introduce a maximum mutual information (MMI) objective to the problem, motivated by the translational equivalence of cognates (Hauer and Kondrak, 2020). MMI has been explored in speech recognition (Bahl et al., 1986;Brown, 1987) and dialog (Li et al., 2016).
Besides MMI, there are a few existing methods for incorporating backward probabilities into the task of translation.  and  follow a noisy channel approach, using Bayes' rule to integrate forward, backward and target language model probabilities. We follow Yee et al. (2019)'s approach and implement both the MMI objective and an ensemble MMI objective.

MMI Reranking
The MMI objective provides a principled way to rerank predicted cognates. We motivate its use with the notion of translational equivalence (Hauer and Kondrak, 2020)-the idea that two words (and particularly when we constrain our focus to cognates) should translate to each other, regardless of direction. Particularly, cognacy is a symmetric relationship. The surface form in each language is a view into the (interlingual) concept. This is explicitly modeled in the MMI objective, which simultaneously optimizes for both the forward and backward direction translations. In the context of translation matrix completion, when filling in a single concept across multiple languages, the translations-and particularly cognate relationshipsshould all be equivalent across all languages (as opposed to sentence level translation, where a sentence in one language can have multiple interpretations in another language).
Predicting cognates de novo is a sequence transduction task akin to transliteration. To perform this task, it is common to use probabilistic models p θ (T | S) of target sequences T given source sequence S, controlled by a set of parameters θ. (We will omit the parameter subscript for brevity.) In our case, T and S are entries in the translation matrix, character sequences from alphabets Σ T and Σ S . The typical decoding objective is to maximize the conditional log-likelihood, which we will use as a baseline: 2 (1) However, to encourage the symmetric relationship of translational equivalence, we can instead use the MMI objective:T Li et al. (2016) show that this can be reformulated as: and then generalize it with a hyperparameter λ to control the weight of each term: Because we wish to predict a word from its several cognates, we can further generalize Li et al. (2016). That is, given source words S i for languages i ∈ 1..n − 1, we can find the target translation that maximizes the mutual information between each source-target language pair: However, in our reformulated objective function, each term log p(S i |T ) in Equation (3) is intractable during decoding. It requires knowledge of the complete prediction T , which is unavailable until decoding has finished. Thus, we approximate this second term by rescoring k-best lists generated by the forward model p(T |S i ). This approximation has previously been used successfully by Li et al. (2016).

Experimental Setup
We formulate lexical translation as sequence-to-sequence character translation, using one model capable of translating between any pair of languages. The input comprises the characters of the word along with tokens to identify the source and target language. Including the target language token in the input conditions the multilingual model to generate text in the target language (Johnson et al., 2017). The output is a character sequence in the target language. We continue to use this single-model architecture even when using multiple known entries in a row of the translation matrix (that is, many translations of the word to be predicted). We do this by combining distributions from each source language in either a mixture or product of experts (Hinton, 2002). We also compare the two decoding objectives: conditional likelihood and maximum mutual information. Dataset Cognate relationships among European languages are well-studied and broadly verifiable. To this end, we use a data set from Dinu and Ciobanu (2014) which contains cognates in six languages: French, Spanish, Italian, Romanian, Portuguese, and Turkish. All use the latin alphabet plus language specific diacritics. Except for Turkish, these are Romance languages. (In fact, Turkish is not even in the Indo-European family. Turkish is included because many French and Turkish words were imported into Romanian, as well as French words into Turkish, leading to cognates between these languages. The comparable performance thereon shows that the method is not limited to linguistic cognates.) Like Mulloni (2007), we group words by cognate cluster using an unsupervised clustering algorithm , which results in 18K clusters covering 16K English concepts. Note that a single concept may have more than one cognate cluster. We hold out 500 concepts for a validation set and 500 concepts for a test set. The validation and test data are then used globally for our scenarios defined below.
Experimental scenarios Broadly, the test inputs available to our models at training and test time define three scenarios, illustrated in Figure 1. In NOVEL, only a single form is present in the row. The model must predict a cognate for a novel concept-one whose forms it has never seen in training. (This would be a first step to filling a completely new row.) In the other two scenarios, only a single entry is missing.
In SINGLE, the model has seen other entries in the row during training, including the single source word, and it must generate the missing form which is a form of indirect supervision. During inference, we only test for the single (directly supervised) language pair. In MULTI, generation is conditioned on all known forms of the concept. Comparing NOVEL to SINGLE addresses whether exposure to the concept's forms is beneficial (they differ in data availability), and comparing SINGLE to MULTI shows whether the standard single-input, single-output sequence transduction framework is sufficient for cognate prediction (they differ in conditioning during inference). In all scenarios, during training the model has seen all extant entries in the rows from the training set and no target-language words from the test set. While not all slots in the Cartesian product of data scenarios (NOVEL, SINGLE, MULTI), decoding objectives (conditional log-likelihood or MMI), and ensembling methods (mixture or product of experts) are plausible, this product subsumes the experiments we run.
Evaluation methods We report three metrics of cognate generation quality. For all, higher is better. The first is exact string match accuracy, following : does the model's 1-best prediction correctly predict the unknown word? The others refine the notion of "inaccurate." Character-level BLEU using SacreBLEU (Papineni et al., 2002;Post, 2018), awards partial credit for inexact matches. Mean reciprocal rank (MRR), following Ciobanu (2016), answers: how far down the k-best list is the correct form?
Experimental details We use a sequence-to-sequence LSTM model with attention (Bahdanau et al., 2015) from the FAIRSEQ toolkit . 3 We train the model using a hidden size of 1024 in both the encoder and decoder, and embedding size of 512 with the NAG optimizer (Botev et al., 2017). In addition, we use a dropout of 0.25, clip gradients to 0.1, and use early stopping with a validation set after 5 epochs of no improvement. All MMI tradeoff λ values are 0.5 unless otherwise specified. We decode using a beam size of 10 and create k-best lists of length k=100.

Results
How much does seeing the concept help? (NOVEL vs. SINGLE) In Table 2, we report exact-match accuracy for both NOVEL and SINGLE without reranking. The model sees up to 25% absolute increases in accuracy between language pairs when tested on data for which it had prior knowledge of non-targetlanguage concepts. Of particular note, performance increases when translating to and from Turkish once related language information is incorporated. This implies that the model can effectively leverage data from outside of the testing language pair, even though the concept has not been seen in the target language. While a similar finding has been shown in multilingual neural sentence-level translation, this is the first time that it has been shown for lexical translation.  By aggregating over languages (Table 3), we see that SINGLE increases performance in all metrics. SIN-GLE's higher MRR shows that the model is not only creating more accurate translations, but the quality of the translations is also higher due to gold predictions found higher in the k-best lists. This is also reflected in the increase in character-level BLEU.

Does using multiple inputs help? (SINGLE vs.
MULTI) The standard encoder-decoder architecture used for sequence transduction does not lend itself well to simple integration of multiple input sequences. Is it enough to pick one source language and use this model as-is (SINGLE), or does the invested effort in ensembling predictions conditioned on multiple sources (MULTI) pay off?
On average, the MULTI model, with its final predictions determined by highest sum of log-probabilities, n i=1 log p(T |S i ), outperforms both NOVEL and SINGLE (Table 4a). In every language besides French, the MULTI model does just as well if not much better translating into the target language.
How should we weight the ensemble? (MULTI vs MULTI-LSE) We observe one shortcoming in the MULTI model: in some cases, it chooses to ignore an answer very highly rated among many source   Table 5: Romanian example of shortcomings of base MULTI objective and how it is inherently corrected for in MULTI-MMI. The MULTI model incorrectly predicts "telefonic" due to it being the highest scored word across all languages; however, in MULTI-MMI this is corrected for by virtue of "telefonic" not being predicted by Turkish.
languages. This is due to the ensembling strategy, which determines a prediction's ensemble score by summing the log-probabilities of this prediction conditioned on each source language. This style of model is a product of experts, in which one low score can effectively 'veto' a prediction when it is low or absent on that language's k-best list. The high weights from other languages cannot salvage it. This motivates us to explore an alternative ensembling strategy, the mixture of experts. In a mixture of experts approach, we define the probability of a translation as p(T |S 1...n−1 ) ∝ n i=1 p(T |S i ). For an explicit example, see Table 5, where we present a snippet of the k-best list for the target Romanian word Telefonist. In the MULTI model, despite having 3 languages agree that Telefonist is the correct translation, the model instead chooses Telefonic, which shows up much lower (rank 9 in the 3 languages) in their k-best lists. Here, Telefonist doesn't show up in the Turkish k-best list, so according to Equation (5), it is given 0% probability, so the combined probability of Telefonist is less than that of Telefonic. This flaw shows that trying to find a globally optimal solution via a product of experts approach may not be the correct way of leveraging multiple sources.
One possible solution is to instead sum the probabilities to form a mixture of experts, which is equivalent to taking the LogSumExp (LSE) of the log-probabilities. Unlike a product of experts, a single model's low probability cannot 'veto'. On average, using LSE improves overall accuracy, at worst it does nothing, and overall it results in our strongest model, MULTI-LSE (results in Table 4b).
How should we decode? (Log-likelihood vs. MMI objective) Above, we described two decoding strategies: maximum conditional log-likelihood and maximum mutual information. The latter infuses a bias toward translational equivalence. Because the model's training objective is the conditional loglikeihood, MMI amounts to a rescoring method of the objective. We can decode NOVEL, SINGLE, and MULTI with either strategy.  Table 6: NOVEL-MMI (Left), and SINGLE-MMI (Right), exact-match accuracy percentage without reranking. Source language on Y-axis. Target language on X-axis.   On average, using rescoring helps NOVEL, but gives mixed results for SINGLE (Table 3). In particular for SIN-GLE, in terms of language-to-language accuracy, there are only a few language pairs that seem to benefit from rescoring (Table 6b). Table 7 contains explicit examples of how the model output is reranked for the Turkish-Romanian language pair under the SINGLE-MMI model. The MMI model takes the correct translation and swaps it with another translation (that often is rank 2 in the list). This would mean that the backward probabilities p(T |S) of these words is larger than their forward counterparts, leading to wrong translations. In the Turkish-Romanian backward model's probabilities (which are the SINGLE results), we see that Romanian-Turkish language pair has very low accuracy, in addition to a very low BLEU score (  Looking at results from NOVEL-MMI, performance on most languages pairs is unchanged, but a few select language pairs have very large jumps in accuracy (Table 6a). For example, for French-Romanian, we see large increases in accuracy, amounting to an overall average gain of 17% against the base model (Table 4a). This implies that many correct translations are highly ranked in the k-best list, and the backward model merely gives these predictions a boost to the top. Further analysis confirms this: when we consider the MRR of only words the French Romanian model get wrong with NOVEL, the MRR is quite high: 0.875. This then indicates that the Romanian backward model is able to find these candidates and bring them to the top.  Figure 2: Effects of λ value on target language accuracy for MULTI-MMI. λ = 0 represents using only the forward probability, and λ = 1 represents using only the backward probability.
Initially, the MULTI-MMI results would suggest that MMI is not useful in an ensemble setting. On closer inspection though, there are a few cases that would suggest otherwise. We previously discussed a flaw with the MULTI model using a product of experts approach. However, MULTI-MMI inherently corrects for this (lower half of Table 5). Here, the MULTI-MMI model correctly chooses Telefonist as the correct translation. Despite the forward model not generating it as a candidate, the backward model scored it high enough to overcome the forward model. In addition, it is interesting that due to the nature of the backward model, the predictions that are towards the top of the list are more similar to the target candidate. We confirm this by computing the Levenshtein distance between the top 10 candidates and the target word (Table 9). Another example given in Table 10 shows a similar phenomenon. While the MULTI model predicts Fonetista, the MULTI-MMI model predicts PHONETISTA. We observe an interesting phenomenon: all words that start with "f" are no longer being scored as highly. Again, decoding is able to pick out the correct word, even when that word is not generated by one model.
In the case of MULTI-MMI with LSE, we find that the gains over MULTI-MMI are greater than the gains between MULTI and MULTI-LSE. We believe that this is due to the backward model acting as a form of regularization that helps flatten the distribution so that one "expert" does not overpower the rest. Despite these gains, it still does not make up for the overall loss in accuracy due to the MMI objective.
Finally, we show the effects of λ on accuracy on a per-language basis (Figure 2). In most cases, a higher λ coincides with decreased performance. However, this does not hold for Romanian, French, and Turkish. First, for both Italian and Portuguese, we see changing λ does not greatly affect accuracy, implies that for these languages, the backward model is only acting as additional noise. In this case the backward distributions are too flat, so adding the backward term is the same as adding a constant to every candidate's score. For Turkish and French, as we increase λ, the model accuracy increases. This might lead one to believe that the backward model is doing all the work, and the forward model is not helping at all. However, this cannot be true: accuracy plummets when only using the backward term. We conclude that the forward model instead acts as a base reference, from which the backward model can then fine tune the results. In the case of Turkish, there seems to be a clear optimal λ of 0.6.

Future Work
The work we presented here has particular applications to low-resource languages. As it is misguided to claim that our system is language-agnostic without verifying (Bender, 2009), we plan to expand this work to other language families, such as the Austronesian phonological cognate dataset of Bouchard-Côté et al. (2013). Another direction involves experimenting with non-uniform mixing weights that can adaptively give preference to certain languages, as in . We would also like to extend this work to generate cognates of inflected forms, rather than lemmas and without explicit lemmatization and inflection subcomponents. Unlike existing cross-lingual morphological inflection tasks (McCarthy et al., 2019;Vylomova et al., 2020) the source and target are in different languages, rather than relying on transfer. Finally, to assess the downstream value of this linguistic tool, future work could populate a statistical translation model's phrase table with predictions from the model.

Conclusion
We present a single neural model to handle multilingual many-to-many translation of single words. In addition, by indirectly leveraging multi-lingual information in sequence-to-sequence models, we can improve accuracy in the matrix completion task (NOVEL vs SINGLE). By allowing knowledge of concepts that will eventually be tested on between non-target language pairs, the model indirectly learns the correct way to translate into an unseen word in the target language. In addition, directly leveraging multiple source languages improves accuracy on average (by 10% relative to our SINGLE model, which is equivalent to ). A flaw in the ensemble scoring method is remedied in part by using LSE, and is also inherently corrected for in the MULTI-MMI model. In addition we show that the MMI objective is a feasible learning object and in some scenarios gives substantially better than baseline performance. One such scenario is when multilingual data is unavailable. When such data is available, our MULTI-LSE model tends to give best performance overall.