Unsupervised Morphology Induction Using Word Embeddings

We present a language agnostic, unsupervised method for inducing morphological transformations between words. The method relies on certain regularities manifest in highdimensional vector spaces. We show that this method is capable of discovering a wide range of morphological rules, which in turn are used to build morphological analyzers. We evaluate this method across six different languages and nine datasets, and show significant improvements across all languages.


Introduction
Word representations obtained via neural networks (Bengio et al., 2003;Socher et al., 2011a) or specialized models (Mikolov et al., 2013a) have been used to address various natural language processing tasks (Mnih et al., 2009;Huang et al., 2014;Bansal et al., 2014). These vector representations capture various syntactic and semantic properties of natural language (Mikolov et al., 2013b). In many instances, natural language uses a small set of concepts to render a much larger set of meaning variations via morphology. We show in this paper that morphological transformations can be captured by exploiting regularities present in wordrepresentations as the ones trained using the Skip-Gram model (Mikolov et al., 2013a).
In contrast to previous approaches that combine morphology with vector-based word representations (Luong et al., 2013;Botha and Blunsom, 2014), we do not rely on an external morphological analyzer, such as Morfessor (Creutz and La- * Work done at Google, now at Human Longevity Inc. gus, 2007). Instead, our method automatically induces morphological rules and transformations, represented as vectors in the same embedding space.
At the heart of our method is the SkipGram model described in (Mikolov et al., 2013a). We further exploit the observations made by Mikolov et al (2013b), and further studied by (Levy and Goldberg, 2014;Pennington et al., 2014), regarding the regularities exhibited by such embedding spaces. These regularities have been shown to allow inferences of certain types (e.g., king is to man what queen is to woman). Such regularities also hold for certain morphological relations (e.g., car is to cars what dog is to dogs). In this paper, we show that one can exploit these regularities to model, in a principled way, prefix-and suffix-based morphology. The main contributions of this paper are as follows: 1. provides a method by which morphological rules are learned in an unsupervised, languageagnostic fashion; 2. provides a mechanism for applying these rules to known words (e.g., boldly is analyzed as bold+ly, while only is not); 3. provides a mechanism for applying these rules to rare and unseen words; We show that this method improves state-of-the-art performance on a word-similarity rating task using standard datasets. We also quantify the impact of our morphology treatment when using large amounts of training data (tens/hundreds of billions of words). The technique we describe is capable of inducing transformations that cover both typical, regular morphological rules, such as adding suffix ed to verbs in English, as well as exceptions to such rules, such as the fact that pluralization of words that end in y require substituting it with ies. Because each such transformation is represented in the high-dimensional embedding space, it therefore captures the semantics of the change. Consequently, it allows us to build vector representations for any unseen word for which a morphological analysis is found, therefore covering an unbounded (albeit incomplete) vocabulary.
Our empirical evaluations show that this language-agnostic technique is capable of learning morphological transformations across various language families. We present results for English, German, French, Spanish, Romanian, Arabic, and Uzbek. The results indicate that the induced morphological analysis deals successfully with sophisticated morphological variations.

Previous Work
Many recent proposals in the literature use wordrepresentations as the basic units for tackling sentence-level tasks such as language modeling (Mnih and Hinton, 2007;Mikolov and Zweig, 2012), paraphrase detection (Socher et al., 2011a), sentiment analysis (Socher et al., 2011b), discriminative parsing (Collobert, 2011), as well as similar tasks involving larger units such as documents (Glorot et al., 2011;Huang et al., 2012;Le and Mikolov, 2014). The main advantage offered by these techniques is that they can be both trained in an unsupervised manner, and also tuned using supervised labels. However, most of these approaches treat words as units, and fail to account for phenomena involving the relationship between various morphological forms that affect word semantics, especially for rare or unseen words.
Previous attempts at dealing with sub-word units and their compositionality have looked at explicitlyengineered features such as stems, cases, POS, etc., and used models such as factored NLMs (Alexandrescu and Kirchhoff, 2006) to obtain representations for unseen words, or compositional distributional semantic models (Lazaridou et al., 2013) to derive representations for morphologically-inflected words, based on the composing morphemes. A more recent trend has seen proposals that deal with mor-phology using vector-space representations (Luong et al., 2013;Botha and Blunsom, 2014). Given word morphemes (affixes, roots), a neural-network architecture (recursive neural networks in the work of Luong et al (2013), log-bilinear models in the case of Botha and Blunsom (2014)), is used to obtain embedding representations for existing morphemes, and also to combine them into (possibly novel) embedding representations for words that may not have been seen at training time.
Common to these proposals is the fact that the morphological analysis of words is treated as an external, preprocessing-style step. This step is done using off-the-shelf analyzers such as Morfessor (Creutz and Lagus, 2007). As a result, the morphological analysis happens within a different model compared to the model in which the resulting morphemes are consequently used. In contrast, the work presented here uses the same vector-space embedding to achieve both the morphological analysis of words and to compute their representation. As a consequence, the morphological analysis can be justified in terms of the relationship between the resulting representation and other words that exhibit similar morphological properties.

Morphology Induction using Embedding Spaces
The method we present induces morphological transformations supported by evidence in terms of regularities within a word-embedding space. We describe in this section the algorithm used to induce such transformations.

Morphological Transformations
We consider two main transformation types, namely prefix and suffix substitutions. Other transformation types can also be considered, but we restrict the focus of this work to morphological phenomena that can be modeled via prefixes and suffixes. We provide first a high-level description of our algorithm, followed by details regarding the individual steps. The following steps are applied to monolingual training data over a finite vocabulary V : 1. Extract candidate prefix/suffix rules from V 2. Train embedding space E n ⊂ R n for all words in V 3. Evaluate quality of candidate rules in E n

Generate lexicalized morphological transformations
We provide more detailed descriptions next.
Extract candidate rules from V Starting from (w 1 , w 2 ) ∈ V 2 , the algorithm extracts all possible prefix and suffix substitutions from w 1 to w 2 , up to a specified size 1 . We denote such substitutions using triplets of the form type:from:to. For instance, triplet suffix:ed:ing denotes the substitution of suffix ed with suffix ing; this substitution is supported by many word pairs in an English vocabulary, e.g. (bored, boring), (stopped, stopping), etc. We call these triplets candidate rules, because they form the basis of an extended set from which the algorithm extracts morphological rules.
At this stage, the candidate rules set contains both rules that reflect true morphology phenomena, e.g. suffix:s: (replace suffix s with the null suffix, extracted from (stops, stop), (weds, wed), etc.), or prefix:un: (replace prefix un with the null prefix, from (undone, done), etc.), but also rules that simply reflect surface-level coincidences, e.g. prefix:S: (delete S at the beginning of a word, from (Scream, cream), (Scope, cope), etc.).

Train embedding space
Using a large monolingual corpus, we train a word-embedding space E n of dimensionality n for all words in V using the SkipGram model (Mikolov et al., 2013a). For the experiments reported in this paper, we used our own implementation of this model (which varies only slightly from the publiclyavailable word2vec implementation 2 ).

Evaluate quality of candidate rules
The extracted candidate rules set is evaluated by using, for each proposed rule r, its support set: The notation w 1 r → w 2 means that rule r applies to word w 1 (e.g., for rule suffix:ed:ing, word w 1 1 A maximum size of 6 is used in our experiments. 2 At code.google.com/p/word2vec.  ends with suffix ed), and the result of applying the rule to word w 1 is word w 2 . To speed up computation, we downsample the sets S r to a large-enough number of word pairs (1000 has been used in the experiments in this paper). We define a generic evaluation function Ev F over paired couples in S r ×S r , using a function F : R n × R n → R, as follows: Word-pair combinations in S r ×S r are evaluated using Eq. 1 to assess the meaning-preservation property of rule r. We use as F E function rank E , the cosine-similarity rank function in E n . We can quantitatively measure the assertion "car is to cars what dog is to dogs", as rank E (cars, car +↑d dog ). We use a single threshold t 0 rank to capture meaning preservation (all the experiments in this paper use t 0 rank = 100): for each proposed rule r, we compute a hit rate based on the number of times Eq. 1 scores above t 0 rank , over the number of times it has been evaluated. In Table 1 we present some of these candidate rules and their hit rate.
We note that rules that are non-meaningpreserving receive low hit rates, while rules that are morphological in nature, such as suffix:ed:ing (verb change from past/participle to presentcontinuous) and suffix:y:ies (pluralization of y-ending nouns), receive high hit rates.

Generate lexicalized morphological transformations
The results in Table 1 indicate the need for creating lexicalized transformations. For instance, rule suffix:ly: (drop suffix ly, a perfectly reasonable morphological transformation in English) is evaluated to have a hit rate of 32.1%. While such transformations are desirable, we want to avoid applying them when firing without yielding meaningpreserving results (the rest of 67.9%), e.g., for wordpair (only, on). We therefore create lexicalized transformations by restricting the rule application to the vocabulary subset of V which passes the meaningpreservation criterion.
The algorithm also computes best direction vectors ↑d w for each rule support set S r . It greedily selects a direction vector ↑d w 0 that explains (based on Equation 1) the most pairs in S r . After subset S w 0 r is computed for direction vector ↑d w 0 , it applies recursively on set S r − S w 0 r . This yields a new best direction vector ↑d w 1 , and so on. The recursion stops when it finds a direction vector ↑d w k that explains less than a predefined number of words (we used 10 in all the experiments from this paper).
We consider multiple direction vectors ↑d w i because of the possibly-ambiguous nature of a morphological transformation.
Consider rule suffix: :s, which can be applied to the noun walk to yield plural-noun walks; this case is modeled with a transformation like walk + ↑d invention , since ↑d invention =inventions−invention is a direction that our procedure deems to explain well noun pluralization; it can also be applied to the verb walk to yield the 3rd-person singular form of the verb, in which case it is modeled as walk + ↑d enlist , since ↑d enlist =enlists−enlist is a direction that our procedure deems to explain well 3rd-person singular verb forms. In that sense, our algorithm goes beyond proposing simple surface-level morphemes, with direction vectors encoding well-defined semantics for our morphological analysis.
Lexicalized rules enhanced with direction vectors are called morphological transformations. For each morphological transformation, we evaluate again how well it passes a proximity test in E n for the words it applies to. As evaluation criteria, we use two instances of Eq 1, with F E instantiated to rank E and cosine E , respectively. We apply more stringent criteria in this second pass, using thresholds on the resulting rank (t rank ) and cosine (t cosine ) values to indicate meaning preservation (we used t rank = 30 and t cosine = 0.5 in all the experiments in this paper). We present in Table 2 a sample of the results of this procedure. For instance, word create can be transformed to creates using two different transformations: suffix:te:tes:↑evaluate and suffix: :s:↑contradict, passing the meaning-preservation criteria with rank=0, co-sine=0.65, and rank=1, cosine=0.62, respectively.
Lexicalized morphological transformations over a vocabulary V have a graph-based interpretation: words represent nodes, transformations represent edges in a labeled, weighted, cyclic, directed multigraph (weights are (r, c) pairs, rank and cosine values; multiple direction vectors create multiple edges between two nodes; cycles may exist, see e.g. created→create→created in Table 2). We use the notation G V M orph to denote such a graph. G V M orph usually contains many strongly connected components, with components representing families of morphological variations. As an illustration, we present in Figure 1 a few strongly connected components obtained for an English embedding space (for illustration purposes, we show only a maximum of 2 directed edges between any two nodes in this multigraph, even though more may exist).

Inducing 1-to-1 Morphological Mappings
The induced graph G V M orph encodes a lot of information about words and how they relate to each other. For some applications, however, we want to normalize away morphological diversity by mapping to a canonical surface form. This amounts to selecting, from among all the candidate morphological transformations generated, specific 1-to-1 mappings. In graph terms, this means building a labeled, weighted, acyclic, directed graph D V M orph starting from G V M orph , using the nodes from G V M orph and retaining only edges that meet certain criteria.
For the experiments presented in Section 4, we build a directed graph D V M orph as follows: 2. if multiple such edges exist, chose the one with minimal rank r; 3. if multiple such edges still exist, chose the one with the maximal cosine c.
The interpretation we give is word-normalization: a normalization of w to w is guaranteed to be meaning preserving (using the direction-vector semantics), and to a more frequent form. A snippet of the resulting graph D V M orph is presented in Figure 2. One notable aspect of this normalization procedure is that these are not "traditional" morphological mappings, with morphology-inflected words mapped to their linguistic roots. Rather, our method produces morphological mappings that favor frequency over linguistic normalization. An example of this can be seen in Figure 2, where the root form create is morphologically-explained by mapping it to the form created. This choice is purely based on our desire to favor the accuracy of the word-representations for the normal forms; different choices regarding how this pruning procedure is performed lead to different normalization procedures, including some that are more linguisticallymotivated (e.g., length-based).

Morphological Transformations for Rare and Unknown Words
For some count threshold C, we define V C = {w ∈ V |C ≤ count(w)}. The method we presented up to this point induces a morphology graph D V C M orph that can be used to perform morphological analysis for any words in V C . We analyze the rest of the words we may encounter (i.e., rare words and OOVs) by mapping them directly to nodes in D V C M orph . We extract such mappings from D V C M orph using all the sequences of edges that start at nodes in the graph and end in a normal-form (i.e., nodes that have out-degree 0). The result is a set of rule sequences denoted RS. A count cutoff on the rule sequence counts is used, since low-count sequences tend to be less reliable (in the experiments reported in this paper we use a cutoff of 50). We also denote with R the set of all edges in D M orph . Using sets RS and R, we map w ∈ V C to a node w ∈ D V C M orph , as follows: 1. for rule-sequences s ∈ RS from highest-tolowest count, if w s → w and w ∈ D V C M orph , then s is the morphological analysis for w; 2. if no s is found, do breadth-first search in D V C M orph using r ∈ R, up to a predefined 3 depth d; for k ≤ d, word w with w r 1 ...r k −→ w ∈ D V C M orph and the highest count in V C is the morphological analysis for w.
For example, this procedure uses the RS sequence s=prefix : un : , suffix : ness : to perform the OOV morphological analysis unassertiveness s −→assertive. We perform an in-depth analysis of the performance of this procedure in Section 4.2.

Empirical Results
In this section, we evaluate the performance of the procedure described in Section 3. Our evaluations aim at answering several empirical questions: how  well does our method capture morphology, and how does it compare with previous approaches that use word-representations for morphology? How well does this method handle OOVs? How does the impact of morphology analysis change with training data size? We provide both qualitative and quantitative answers for each of these questions next.

Quality of Morphological Analysis
We first evaluate the impact of our morphological analysis on a standard word-similarity rating task. The task measures word-level understanding by comparing the correlation between humanproduced similarity ratings for word pairs, e.g. (intraspecific, interspecies), with those produced by an algorithm. For the experiments reported here, we train SkipGram models 4 using a dimensionality of n = 500. We denote a system using only Skip-Gram model embeddings as SG. To evaluate the impact of our method, we perform morphological analysis for words below a count threshold C. For a word w ∈ D V C M orph , we simply use the SkipGram vector-representation; for a word w ∈ D V C M orph , we use as word-representation its mapping in D V C M orph ; we denote such a system SG+Morph. For both SG and SG+Morph systems, we compute the similarity of word-pairs using the cosine distance between the vector-representations.

Data
We train both the SG and SG+Morph models from scratch, for all languages considered. For English, we use the Wikipedia data (Shaoul and Westbury, 2010). For German, French, and Spanish, we use the monolingual data released as part of the WMT-2013 shared task (Bojar et al., 2013). For Arabic we use the Arabic GigaWord corpus (Parker et al., 2011). For Romanian and Uzbek, we use collections of News harvested from the web and cleaned (boilerplate removed, formatting removed, encoding made consistent, etc.). All SkipGram models are trained using a count cutoff of 5 (all words with count less than the cutoff are ignored). Table 3 presents statistics on the data and vocabulary size, as well as the size of the induced morphology graphs. These numbers illustrate the richness of the morphological phenomena present in languages such as German, Romanian, Arabic, and Uzbek, compared to English.
As test sets, we use standard, publicly-available word-similarity datasets. Most relevant for our approach is the Stanford English Rare-Word (RW) dataset (Luong et al., 2013), consisting of 2034 word pairs with a higher degree of English morphology compared to other word-similarity datasets. We also use for English the WS353 (Finkelstein et al., 2002) and RG65 datasets (Rubenstein and Goodenough, 1965). For German, we use the Gur350 and ZG222 datasets (Zesch and Gurevych, 2006). For French we use the RG65 French version (Joubarne and Inkpen, 2011); for Spanish, Romanian, and Arabic we use their respective versions of WS353 (Hassan and Mihalcea, 2009).

Results
We present in Table 4 the results obtained across 6 language pairs and 9 datasets, using a count threshold for SG+Morph of C = 100. We also include the results obtained by two previouslyproposed methods, LSM2013 (Luong et al., 2013) and BB2014 (Botha and Blunsom, 2014), which share some of the characteristics of our method.
Even in the absence of any morphological treatment, our word representations are better than previously used ones. For instance, LSM2013 uses exactly the same EN Wikipedia (Shaoul and Westbury, 2010)   ment under the morphology condition). The morphological treatment used by LSM2013 also has a small effect on the words present in the English WS and RG sets; our method does not propose any separate morphological treatment for the words in these datasets, since all of them have been observed more than our C = 100 threshold in the training data (therefore have reliable representations). The SG word-representations for all the other languages (German, French, Spanish, Romanian, and Arabic) also perform well on this task, with much higher Spearman scores obtained by SG compared with the previously-reported scores.
The results in Table 4 also show that our morphology treatment provides consistent gains across all languages considered. For morphologically-rich languages, all datasets reflect the impact of morphology treatment. We observe significant gains between the performance of the SG and SG+Morph systems, on top of the high correlation numbers of the SG system. For German, the relatively small increase we observe is due to the fact the German noun-compounds are not covered by our morphological treatment. For French, Spanish, Romanian, and Arabic, the gains by the SG+Morph support the conclusion that our method, while completely languageagnostic, handles well the variety of morphological phenomena present in these languages.

Quality of Morphological Analysis for
Unknown/Rare Words In this section, we quantify the accuracy of the morphological treatment for OOVs presented in Sec-tion 3.3. We assume that the statistics for unseen words (with respect to their morphological makeup) are similar with the statistics for low-frequency words. Therefore, for some relatively-low counts L and H, the set V [L,H) = V L − V H is a good proxy for the population of OOV words that we see at runtime. We evaluate OOV morphology as follows: To make the analysis more revealing, we split the entries in V [L,H) in two: type T1 entries are those that have in-degree > 0 in D V L M orph (i.e., words that have a morphological mapping in the reference graph); type T2 entries are those that have 0 in-degree in D V L M orph (i.e., words with no morphological mapping in the reference, e.g., proper-nouns in English). Note that the T1/T2 distinction reflects a recall/precision trade-off: T1-words should be morphologically analyzed, while T2-words should not; a method that over-analyses has poor performance on T2, while one that under-analyses performs poorly on T1.
We use the same datasets as the ones presented in Section 4.1, see Table 3. The results for all the languages are shown in Table 6, with all rows using   the same setup. Count L = 1000 was chosen such that D V L M orph is reliable enough to be used as reference. The accuracy results are consistently high (in the 80-90% range) for both T1-and T2-words, even for morphologically-rich languages such as Uzbek.
These results indicate that our method does well at both identifying a morphological analysis when appropriate, as well as not proposing one when not justified, and therefore provides accurate morphology analysis for rare and OOV words.

Morphology and Training Data Size
We also evaluate the impact of our morphology analysis under a regime with substantially more training data. To this end, we use large collections of English and German News, harvested from the web and cleaned (boiler-plate removed, formatting removed, encoding made consistent). Statistics regarding the resulting vocabularies and the induced morphology are presented in Table 7 (vocabulary cutoffs of 400 for EN and 50 for DE). We present results using the word-similarity task using the same Stanford Rare-Word (RW) dataset for EN and RG dataset for DE, compared against the setup using only 1-2 billion training tokens. For SG+Morph, we use count thresholds of 3000 for EN and 100 for DE. The results are given in Table 5  crease in the training data for EN brings a 10-point increase in Spearman ρ (from 35.8 to 44.7, and from 41.8 to 52.0). The morphological analysis provides substantial gains at either level of training-data size: 6 points in ρ for Wiki1b (from 35.8 to 41.8), and 7.3 points for News120b EN (from 44.7 to 52.0). For German, the increase in training-data size does not bring visible improvements (perhaps due the high vocabulary cutoff), but the morphological treatment has a large impact under the large training-data condition (7 points for News20b DE, from 62.1 to 69.1).

Conclusions and Future Work
We have presented an unsupervised method for morphology induction. The method derives a morphological analyzer from scratch, and only requires a monolingual corpus for training, with no additional knowledge of the language. Our evaluation shows that this method performs well across a large variety of language families, and we present here results that improve on current state-of-the-art for the morphologically-rich Stanford Rare-word dataset.
We acknowledge that certain languages exhibit phenomena (such as word-compounds in German) that require a more focused approach for solving them. But techniques like the ones presented here have the potential to exploit vector-based word representations successfully to address such phenomena as well.