The Interplay of Semantics and Morphology in Word Embeddings

We explore the ability of word embeddings to capture both semantic and morphological similarity, as affected by the different types of linguistic properties (surface form, lemma, morphological tag) used to compose the representation of each word. We train several models, where each uses a different subset of these properties to compose its representations. By evaluating the models on semantic and morphological measures, we reveal some useful insights on the relationship between semantics and morphology.


Introduction
Word embedding models learn a space of continuous word representations, in which similar words are expected to be close to each other. Traditionally, the term similar refers to semantic similarity (e.g. walking should be close to hiking, and happiness to joy), hence the model performance is usually evaluated using semantic similarity datasets. Recently, several works introduced morphology-driven models motivated by the poor performance of traditional models on morphologically complex words. Such words are often rare, and there is not enough evidence to model them correctly. The morphology-driven models allow pooling evidence from different words which have the same base form. These models work by learning per-morpheme representations rather than just per-word ones, and compose the representing vector of each word from those of its morphemes -as derived from a supervised or unsupervised morphological analysis -and (optionally) its surface form (e.g. walking = f (v walk , v ing , v walking )).
The works differ in the way they acquire morphological knowledge (from using linguistically derived morphological analyzers on one end, to approximating morphology using substrings while relying on the concatenative nature of morphology, on the other) and in the model form (cDSMs (Lazaridou et al., 2013), RNN (Luong et al., 2013), LBL (Botha and Blunsom, 2014), CBOW (Qiu et al., 2014), SkipGram (Soricut and Och, 2015;Bojanowski et al., 2016), GGM (Cotterell et al., 2016)). But essentially, they all show that breaking a word into morphological components (base form, affixes and potentially also the complete surface form), learning a vector for each component, and representing a word as a composition of these vectors improves the models semantic performance, especially on rare words.
In this work we argue that these models capture two distinct aspects of word similarity, semantic (e.g. sim(walking, hiking) > sim(walking, eating)) and morphological (e.g. sim(walking, hiking) > sim(walking, hiked)), and that these two aspects are at odds with each other (should sim(walking, hiking) be lower or higher than sim(walking, walked)?). The base form component of the compositional models is mostly responsible for semantic aspects of the similarity, while the affixes are mostly responsible for morphological similarity.
This analysis brings about several natural questions: is the combination of semantic and morphological components used in previous work ideal for every purpose? For example, if we exclude the morphological component from the representations, wouldn't it improve the semantic performance? What is the contribution of using the surface form? And do the models behave differently on common and rare words? We explore these questions in order to help the users of morphology-driven models choose the right configuration for their needs: semantic or morphological performance, on common or rare words.
We compare different configurations of morphology-driven models, while controlling for the components composing the representation. We then separately evaluate the semantic and morphological performance of each model, on rare and on common words. We focus on inflectional (rather than derivational) morphology. This is due to the fact that derivations (e.g. affected → unaffected) often drastically change the meaning of the word, and therefore the benefit of having similar representations for words with the same derivational base is questionable, as discussed by Lazaridou et al (2013) and Luong et al (2013). Inflections (e.g. walked → walking), in contrast, preserve the word lexical meaning, and only change its grammatical categories values.
Our experiments are performed on Modern Hebrew, a language with rich inflectional morphological system. We build on a recently introduced evaluation dataset for semantic similarity in Modern Hebrew (Avraham and Goldberg, 2016), which we further extend with a collection of rare words. We also create datasets for morphological similarity, for common and rare words. Hebrew's morphology is not concatenative, so unlike most previous work we do not break the words into base and affixes, but instead rely on a morphological analyzer and represent words using their lemmas (corresponding to the base form) and their morphological tags (from which the morphological forms are derived, corresponding to affixes). This allow us to have a finer grained control over the composition, separating inflectional from derivational processes. We also compare to a strong character ngram based model, that mixes the different components and does not allow finergrained distinctions.
We observe a clear trade-off between the morphological and semantic performance -models that excel on one metric perform badly on the other. We present the strengths and weaknesses of the different configurations, to help the users choose the one that best fits their needs. To the best of our knowledge, this work is the first to make a comprehensive comparison between various configurations of morphology-driven models, 1 as well as the first to evaluate both seman-1 Among the previous work mentioned above, only few explored configurations other than (base + affixes) or (surface + base + affixes). Lazaridou et al (2013) and Luong et al (2013) trained models which represent a word by its base only, and showed that these models performs worse than the tic and morphological performance of such models. While our experiments focus on Modern Hebrew due to the availability of a reliable semantic similarity dataset, we believe our conclusions hold more generally.

Models
Our model form is a generalization of the fast-Text model (Bojanowski et al., 2016), which in turn extends the skip-gram model of Mikolov et al (2013). The skip-gram model takes a sequence of words w 1 , ..., w T and a function s assigning scores to (word, context) pairs, and maximizes where is the log-sigmoid loss function, C t is a set of context words, and N t is a set of negative examples sampled from the vocabulary. s(w t , w c ) is defined as s(w t , w c ) = u wt v wc (where u wt and v wc are the embeddings of the focus and the context words). Bojanowski et al (2016) replace the word representation v wt with the set of character ngrams appearing in it: is the set of n-grams appearing in w t . The n-grams are used to approximate the morphemes in the target word.
We generalize Bojanowski et al (2016) by replacing the set of ngrams G(w) with a set P(w) of explicit linguistic properties. Each word w t is then composed as the sum of the vectors of its linguistic properties: v wt = p∈P(wt) v p . The linguistic properties we consider are the surface form of the word (W), it's lemma (L) and its morphological tag (M) 2 . The lemma corresponds to the base-form, and the morphological tag encodes the grammatical properties of the word, from which its inflectional affixes are derived (a similar approach was taken by Cotterell and Schütze (2015)). Moving from a set of ngrams to a set of explicit linguistic properties, allows finer control of the kinds of information in compositional ones (base + affixes). However, the poor results for the base-only models were mainly attributed to undesirable capturing of derivational similarity, e.g. (affected, unaffected). Working with a more linguistically informed morphological analyzer allows us to tease apart inflectional from derivational processes, leading to different results. 2 The lemma and morphological tag for a word in context are obtained using a morphological analyzer and disambiguator. Then, each value of lemma/tag/surface from is associated with a trainable embedding vector. the word representation. We train models with different subsets of {W, L, M }.

Experiments and Results
Our implementation is based on the fastText 3 library (Bojanowski et al., 2016), which we modify as described above. We train the models on the Hebrew Wikipedia (∼4M sentences), using a window size of 2 to each side of the focus word, and dimensionality of 200. We use the morphological disambiguator of Adler (2007) to assign words with their morphological tags, and the inflection dictionary of MILA (Itai and Wintner, 2008)  Semantic Evaluation Measure The common datasets for semantic similarity 5 have some notable shortcomings as noted in (Avraham and Goldberg, 2016;Faruqui et al., 2016;Batchkarov et al., 2016;Linzen, 2016). We use the evaluation method (and corresponding Hebrew similarity dataset) that we have introduced in a previous work (Avraham and Goldberg, 2016) (AG). The AG method defines an annotation task which is more natural for human judges, resulting in datasets with improved annotator-agreement scores. Furthermore, the AG's evaluation metric takes annotator agreement into account, by putting less weight on similarities that have lower annotator agreement.
An AG dataset is a collection of target-groups, where each group contains a target word (e.g. singer) and three types of candidate words: positives which are words "similar" to the target (e.g. musician), distractors which are words "related but dissimilar" to the target (e.g. microphone), and randoms which are not related to the target at all (e.g laptop). The human annotators are asked to rank the positive words by their similarity to the target word (distractor and random words are not annotated by humans and are automatically ranked below the positive words). This results in a set of triples of a target word w and two candidate words c 1 , c 2 , coupled with a value indicating the confidence of ranking sim(w, c 1 ) > sim(w, c 2 ) by the annotators. A model is then scored based on its ability to correctly rank each triple, giving more weight to highly-confident triples. The scores range between 0 (all wrong answers) to 1 (perfect match with human annotators).
We use this method on two datasets: the AG dataset from (Avraham and Goldberg, 2016) (Se-manticSim, containing 1819 triples), and a new dataset we created in order to evaluate the models on rare words (similar to RW (Luong et al., 2013)). The rare-words dataset (SemanticSim-Rare) follows the structure of SemanticSim, but includes only target words that occur less than 100 times in the corpus. It contains a total of 163 triples, all of the type positive vs. random (we find that for rare words, distinguishing similar words from random ones is a hard enough task for the models). (2015) introduced the MorphoDist k measure, which quantifies the amount of morphological difference between a target word and a list of its k most similar words. We modify MorphDist k measure to derive MorphSim k , a measure that ranges between 0 and 1, where 1 indicates total morphological compatibility.

Morphological Evaluation Measure Cotterrel and Schütze
The MorphDist measure is defined as: M orphoDist k (w) = w ∈Kw min mw,m w d h (m w , m w ) where K w is the set of top-k similarities of w, m w and m w are possible morphological tags of w and w respectively (there may be more than one possible morphological interpretation per word), and d h is the Hamming distance between the morphological tags. MorphoDist counts the total number of incompatible morphological components. MorphSim k calculates the average rate of compatible morphological values. More formally, M orphoSim k (w) = 1 − M orphoDist k (w) k·|mw| , where |m w | is the number of grammatical components specified in w's morphological tag.
We use k=10 and calculate the average Mor-phoSim score over 100 randomly chosen words.  Each entry is of the form [word:lexical meaning:morphological tag]. Green-colored items share the semantic/inflection of the target word, while red-colored indicate a divergence. In the morphological tags: M/F/MF indicate masculine/feminine/both, P/S indicate plural/singular, 1/2/3 indicate 1st/2nd/3rd person.
To evaluate the morphological performance on rare words, we run another benchmark (Mor-phoSimRare) in which we calculate the average MorphoSim score over the 35 target words of the SemanticSimRare dataset.

Qualitative Results
To get an impression of the differences in behavior between the models, we queried each model for the top similarities of several words (calculated by cosine similarity between words vectors), focusing on rare words. Table 1 presents the top-3 similarities for the word ‫הסתכלה‬ ([she] looked [at]), which occurs 17 times in the corpus, under the different models. Unsurprisingly, the lemma component has a positive effect on semantics, while the tag component improves the morphological performance. It also shows a clear trade-off between the two aspects -as models which perform the best on semantics are the worst on morphology. This behavior is representative of the dozens of words we examined.

Quantitative Results
We compare the different models on the different measures, and also compare to the state-of-the-art n-gram based fastText model of Bojanowski et al (2016) that does not require morphological analysis. The results (Table 2) highlight the following: 1. There is a trade-off between semantic and morphological performance -improving one aspect comes at the expense of the other: the lemma component improves semantics but hurts morphology, while the opposite is true for the tag component. The common practice of using both components together is a kind of compromise: the LM, WLM and n-grams models are not the best nor the worst on any measure.  3. Simply lemmatizing the words is very effective for capturing semantic similarity. This is especially true for the rare words, in which the L model clearly outperform all others. For the common words, we see a small drop compared to including the surface form as well (WL, WLM). This is attributed to cases in which some of the semantics lies within the word's morphological template, for example: in W model, most similar words for the masculine verb ‫נפל‬ (fell) are associated with a soldier (which is a masculine noun): ‫נהרג‬ (was killed), ‫נפגע‬ (was injured), while the similarities of the feminine form ‫נפלה‬ are associated with a land or a state (both are feminine nouns): ‫סופחה‬ (was annexed), ‫נכבשה‬ (was occupied). In L model -‫נפלה‬ and ‫נפל‬ share a single, less accurate representation (somewhat similarly to representations of ambiguous words). This suggests using different compositions for common and rare words.

Conclusions
Our key message is that users of morphologydriven models should consider the trade-off between the different components of their representations. Since the goal of most works on morphology-driven models was to improve semantic similarity, the configurations they used (which combine both semantic and morphological components) were probably not the best choices: we show that using the lemma component (either alone or together with the surface form) is better. Indeed, excluding the morphological component will make the morphological similarity drop, but it's not necessarily a problem for every task. One should include the morphological component in the embeddings only for tasks in which morphological similarity is required and cannot be handled by other means. A future work can be to perform an extrinsic evaluation of the different models in various downstream applications. This may reveal which kinds of tasks benefit from morphological information, and which can be done better by a pure semantic model.