Orthographic Features for Bilingual Lexicon Induction

Recent embedding-based methods in bilingual lexicon induction show good results, but do not take advantage of orthographic features, such as edit distance, which can be helpful for pairs of related languages. This work extends embedding-based methods to incorporate these features, resulting in significant accuracy gains for related languages.


Introduction
Over the past few years, new methods for bilingual lexicon induction have been proposed that are applicable to low-resource language pairs, for which very little sentence-aligned parallel data is available. Parallel data can be very expensive to create, so methods that require less of it or that can utilize more readily available data are desirable.
One prevalent strategy involves creating multilingual word embeddings, where each language's vocabulary is embedded in the same latent space (Vulić and Moens, 2013;Mikolov et al., 2013a;Artetxe et al., 2016); however, many of these methods still require a strong cross-lingual signal in the form of a large seed dictionary.
More recent work has focused on reducing that constraint. Vulić and Moens (2016) and Vulic and Korhonen (2016) use document-aligned data to learn bilingual embeddings instead of a seed dictionary. Artetxe et al. (2017) use a very small, automatically-generated seed lexicon of identical numerals as the initialization in an iterative self-learning framework to learn a linear mapping between monolingual embedding spaces; Zhang et al. (2017) use an adversarial training method to learn a similar mapping. Lample et al. (2018a) use a series of techniques to align monolingual embedding spaces in a completely unsupervised way; their method is used by Lample et al. (2018b) as the initialization for a completely unsupervised machine translation system. These recent advances in unsupervised bilingual lexicon induction show promise for use in low-resource contexts. However, none of them make use of linguistic features of the languages themselves (with the arguable exception of syntactic/semantic information encoded in the word embeddings). This is in contrast to work that predates many of these embedding-based methods that leveraged linguistic features such as edit distance and orthographic similarity: Dyer et al. (2011) andBerg-Kirkpatrick et al. (2010) investigate using linguistic features for word alignment, and Haghighi et al. (2008) use linguistic features for unsupervised bilingual lexicon induction. These features can help identify words with common ancestry (such as the English-Italian pair agile-agile) and borrowed words (macaronimaccheroni).
The addition of linguistic features led to increased performance in these earlier models, especially for related languages, yet these features have not been applied to more modern methods. In this work, we extend the modern embeddingbased approach of Artetxe et al. (2017) with orthographic information in order to leverage similarities between related languages for increased accuracy in bilingual lexicon induction.

Background
This work is directly based on the work of Artetxe et al. (2017). Following their work, let X ∈ R |Vs|×d and Z ∈ R |Vt|×d be the word embedding matrices of two distinct languages, referred to respectively as the source and target, such that each row corresponds to the d-dimensional embedding of a single word. We refer to the ith row of one of these matrices as X i * or Z i * . The vocabularies for each language are V s and V t , respectively. Also let D ∈ {0, 1} |Vs|×|Vt| be a binary matrix representing a dictionary such that D ij = 1 if the ith word in the source language is aligned with the jth word in the target language. We wish to find a mapping matrix W ∈ R d×d that maps source embeddings onto their aligned target embeddings. Artetxe et al. (2017) define the optimal mapping matrix W * with the following equation, which minimizes the sum of the squared Euclidean distances between mapped source embeddings and their aligned target embeddings. By normalizing and mean-centering X and Z, and enforcing that W be an orthogonal matrix (W T W = I), the above formulation becomes equivalent to maximizing the dot product between the mapped source embeddings and target embeddings, such that where Tr(·) is the trace operator, the sum of all diagonal entries. The optimal solution to this equa- This formulation requires a seed dictionary. To reduce the need for a large seed dictionary, Artetxe et al. (2017) propose an iterative, self-learning framework that determines W as above, uses it to calculate a new dictionary D, and then iterates until convergence. In the dictionary induction step, We propose two methods for extending this system using orthographic information, described in the following two sections.

Orthographic Extension of Word Embeddings
This method augments the embeddings for all words in both languages before using them in the self-learning framework of Artetxe et al. (2017).
To do this, we append to each word's embedding a vector of length equal to the size of the union of the two languages' alphabets. Each position in this vector corresponds to a single letter, and its value is set to the count of that letter within the spelling of the word. This letter count vector is then scaled by a constant before being appended to the base word embedding. After appending, the resulting augmented vector is normalized to have magnitude 1. Mathematically, let A be an ordered set of characters (an alphabet), containing all characters appearing in both language's alphabets: Let O source and O target be the orthographic extension matrices for each language, containing counts of the characters appearing in each word w i , scaled by a constant factor c e : Then, we concatenate the embedding matrices and extension matrices: Finally, in the normalized embedding matrices X and Z , each row has magnitude 1: These new matrices are used in place of X and Z in the self-learning process.

Orthographic Similarity Adjustment
This method modifies the similarity score for each word pair during the dictionary induction phase of the self-learning framework of Artetxe et al. (2017), which uses the dot product of two words' embeddings to quantify similarity. We modify this similarity score by adding a measure of orthographic similarity, which is a function of the normalized string edit distance of the two words.
The normalized edit distance is defined as the Levenshtein distance (L(·, ·)) (Levenshtein, 1966) divided by the length of the longer word. The Levenshtein distance represents the minimum number of insertions, deletions, and substitutions required to transform one word into the other. The normalized edit distance function is denoted as NL(·, ·).
NL(w 1 , w 2 ) = L(w 1 , w 2 ) max(|w 1 |, |w 2 |) We define the orthographic similarity of two words w 1 and w 2 as log(2.0−NL(w 1 , w 2 )). These similarity scores are used to form an orthographic similarity matrix S, where each entry corresponds to a source-target word pair. Each entry is first scaled by a constant factor c s . This matrix is added to the standard similarity matrix, XW Z T .
The vocabulary for each language is 200,000 words, so computing a similarity score for each pair would involve 40 billion edit distance calculations. Also, the vast majority of word pairs are orthographically very dissimilar, resulting in a normalized edit distance close to 1 and an orthographic similarity close to 0, having little to no effect on the overall estimated similarity. Therefore, we only calculate the edit distance for a subset of possible word pairs. Thus, the actual orthographic similarity matrix that we use is as follows: This subset of word pairs was chosen using an adaptation of the Symmetric Delete spelling correction algorithm described by Garbe (2012), which we denote as symDelete(·,·,·). This algorithm takes as arguments the target vocabulary, source vocabulary, and a constant k, and identifies all source-target word pairs that are identical after k or fewer deletions from each word; that is, all pairs where each is reachable from the other with no more than k insertions and k deletions. For example, the Italian-English pair modernomodern will be identified with k = 1, and the pair tollerante-tolerant will be identified with k = 2.
The algorithm works by computing all strings formed by k or fewer deletions from each target word, stores them in a hash table, then does the same for each source word and generates sourcetarget pairs that share an entry in the hash table. The complexity of this algorithm can be expressed as O(|V |l k ), where V = V t ∪ V s is the combined vocabulary and l is the length of the longest word in V . This is linear with respect to the vocabulary size, as opposed to the quadratic complexity required for computing the entire matrix. However, the algorithm is sensitive to both word length and the choice of k. In our experiments, we found that ignoring all words of length greater than 30 allowed the algorithm to complete very quickly while skipping less than 0.1% of the data. We also used small values of k (0 < k < 4), and used k = 1 for our final results, finding no significant benefit from using a larger value.

Experiments
We use the datasets used by Artetxe et al. (2017), consisting of three language pairs: English-Italian, English-German, and English-Finnish. The English-Italian dataset was introduced in Dinu and Baroni (2014); the other datasets were created by Artetxe et al. (2017). Each dataset includes monolingual word embeddings (trained with word2vec (Mikolov et al., 2013b)) for both languages and a bilingual dictionary, separated into a training and test set. We do not use the training set as the input dictionary to the system, instead using an automatically-generated dictionary consisting only of numeral identity translations (such as 2-2, 3-3, et cetera) as in Artetxe et al. (2017). 1 However, because the methods presented in this work feature tunable hyperparameters, we use a portion of the training set as devel-

Results and Discussion
For our experiments with orthographic extension of word embeddings, each embedding was extended by the size of the union of the alphabets of both languages. The size of this union was 199 for English-Italian, 200 for English-German, and 287 for English-Finnish. These numbers are perhaps unintuitively high. However, the corpora include many other characters, including diacritical markings and various symbols (%, [, !, etc.) that are an indication that tokenization of the data could be improved. We did not filter these characters in this work.
For our experiments with orthographic similarity adjustment, the heuristic identified approximately 2 million word pairs for each language pair out of a possible 40 billion, resulting in significant computation savings. Figure 1 shows the results on the development data. Based on these results, we selected c e = 1 8 and c s = 1 as our hyperparameters. The local optima were not identical for all three languages, but we felt that these values struck the best compromise among them. Table 1 compares our methods against the system of Artetxe et al. (2017), using scaling factors selected based on development data results. Because approximately 20% of source-target pairs in the dictionary were identical, we also extended all systems to guess the identity translation if the source word appeared in the target vocabulary. This improved accuracy in most cases, with some exceptions for English-Italian. We also experimented with both methods together, and found that this was the best of the settings that did not include the identity translation component; with the identity component included, however, the embedding extension method alone was best for English-Finnish. The fact that Finnish is the only language here that is not in the Indo-European family (and has fewer words borrowed from English or its ancestors) may explain why the performance trends for English-Finnish were different than those of the other two language pairs. In addition to identifying orthographically similar words, the extension method is capable of learning a mapping between source and target letters, which could partially explain its improved performance over our edit distance method. Table 2 shows some correct translations from our system that were missed by the baseline.

Conclusion and Future Work
In this work, we presented two techniques (which can be combined) for improving embedding-based bilingual lexicon induction for related languages using orthographic information and no parallel data, allowing their use with low-resource language pairs. These methods increased accuracy in our experiments, with both the combined and embedding extension methods providing significant gains over the baseline system.
In the future, we want to extend this work to related languages with different alphabets (experimenting with transliteration or phonetic transcription) and to extend other unsupervised bilingual lexicon induction systems, such as that of Lample et al. (2018a).