MORSE: Semantic-ally Drive-n MORpheme SEgment-er

We present in this paper a novel framework for morpheme segmentation which uses the morpho-syntactic regularities preserved by word representations, in addition to orthographic features, to segment words into morphemes. This framework is the first to consider vocabulary-wide syntactico-semantic information for this task. We also analyze the deficiencies of available benchmarking datasets and introduce our own dataset that was created on the basis of compositionality. We validate our algorithm across datasets and present state-of-the-art results.


Introduction
Morpheme segmentation is a core natural language processing (NLP) task used as an integral component in related-fields such as information retrieval (IR) (Zieman and Bleich, 1997;Kurimo et al., 2007), automatic speech recognition (ASR) (Bilmes and Kirchhoff, 2003;Kurimo et al., 2006), and machine translation (MT) (Lee, 2004;Virpioja et al., 2007). Most previous works have relied solely on orthographic features (Harris, 1970;Goldsmith, 2000;Creutz and Lagus, 2002, 2005, neglecting the underlying semantic information. This has led to an oversegmentation of words because a change of the surface form pattern is a necessary but insufficient indication of a morphological change. For example, although the surface form of "freshman", hints that it should be segmented to "fresh-man", although "freshman" does not describe semantically the compositional meaning of "fresh" and "man". To compensate for this lack of semantic knowledge, previous works (Schone and Jurafsky, 2000; Baroni et al., 2002;Narasimhan et al., 2015) have incorporated semantic knowledge locally by checking the semantic relatedness of possibly morphologically related pair of words. Narasimhan et al. (2015) check for semantic relatedness using cosine similarity in word representations (Mikolov et al., 2013a;Pennington et al., 2014). A limitation of such an approach is the inherent "sample noise" in specific word representations (exacerbated in the case of rare words). Moreover, limitation to local comparison enforces modeling morphological relations via semantic relatedness, although it has been shown that difference vectors model morphological relations more accurately (Mikolov et al., 2013b). To address this issue, we introduce a new framework (MORSE), the first to bring semantics into morpheme segmentation both on a local and a vocabulary-wide level. That is, when checking for the morphological relation between two words, we not only check for the semantic relatedness of the pair at hand (local), but also check if the difference vectors of pairs showing similar orthographic change are consistent (vocabulary-wide).
In summary, MORSE clusters pairs of words which only vary by an affix; for example, pairs such as ("quick", "quickly") and ("hopeful", "hopefully") get clustered together. To verify the cluster of a specific affix from a semantic corpuswide standpoint, we check for the consistency of the difference vectors (Mikolov et al., 2013b). To evaluate it from an orthographic corpus-wide perspective, we check for the size of each cluster of an affix. To evaluate each pair in a cluster locally from a semantic standpoint, we check if a pair of words in a valid affix cluster are morphologically related by checking if its difference vector is consistent with other members in the cluster and if the words in the pair are semantically related (i.e. close in the vector space). The reason for local evaluations is exemplified by ("on","only") which belongs to the cluster of a valid affix ("ly"), although they are not (obviously) morphologically related. We would expect such a pair to fail the last two local evaluation methods.
Our proposed segmentation algorithm is evaluated using benchmarking datasets from the Morpho Challenge (MC) for multiple languages and a newly introduced dataset for English which compensates for lack of discriminating capabilities in the MC dataset. Experiments reveal that our proposed framework not only outperforms the widely used approach, but also performs better than published state-of-the-art results.
The central contribution of this work is a novel framework that performs morpheme segmentation resulting in new state-of-the-art results. To the best of our knowledge this is the first unsupervised approach to consider the vocabulary-wide semantic knowledge of words and their affixes in addition to relying on their surface forms. Moreover we point out the deficiencies in the MC datasets with respect to the compositionality of morphemes and introduce our own dataset free of these deficiencies.

Related Work
Extensive work has been done in morphology learning, with tasks such as morphological analysis (Baayen et al., 1993), morphological reinflection , and morpheme segmentation. Given the less complex nature of morpheme segmentation in comparison to the other tasks, most systems developed for morpheme segmentation have been unsupervised or minimally supervised (mostly for parameter tuning).
Unsupervised morpheme segmentation traces back to (Harris, 1970), which falls under the framework of Letter Successor Variety (LSV) which builds on the hypothesis that predictability of successor letters is high within morphemes and low otherwise. The most dominant pieces of work on unsupervised morpheme segmentation, Morfessor (Creutz and Lagus, 2002, 2005 and Linguistica (Goldsmith, 2000) adopt the Minimum Description Length (MDL) principle (Rissanen, 1998): they aim to minimize describing the lexicon of morphs as well as minimizing the description of an input corpus. Morfessor has a widely used API and has inspired a large body of following work (Kohonen et al., 2010;Grönroos et al., 2014).
The unsupervised original implementation was later adapted (Kohonen et al., 2010;Grönroos et al., 2014) to allow for minimal supervision. Another work on minimally supervised morpheme segmentation is (Sirts and Goldwater, 2013) which relies on Adaptor Grammars (AGs) (Johnson et al., 2006). AGs learn latent tree structures over an input corpus using a nonparametric Bayesian model (Sirts and Goldwater, 2013). (Lafferty et al., 2001) use Conditional Random Fields (CRF) for morpheme segmentation. In this supervised method, the morpheme segmentation task is modeled as a sequence-to-sequence learning problem, whereby the sequence of labels defines the boundaries of morphemes (Ruokolainen et al., 2013(Ruokolainen et al., , 2014. In contrast to the previously mentioned generative approaches of MDL and AG, this method takes a discriminative approach and allows for the inclusion of a larger set of features. In this approach, CRF learns a conditional probability of a segmentation given a word (Ruokolainen et al., 2013(Ruokolainen et al., , 2014. All these morpheme segmenters rely solely on orthographic features of morphemes. Semantics were initially introduced to morpheme segmenters by (Schone and Jurafsky, 2000), using LSA to generate word representations and then evaluate if two words are morphologically related based on semantic relatedness, as well as deterministic orthographic methods. Similarly, (Baroni et al., 2002) use edit distance and mutual information as metrics for semantic and orthographic validity of a morphological relation between two words. Recent work in (Narasimhan et al., 2015), inspired by the log-linear model in (Poon et al., 2009) incorporates semantic relatedness into the model via word representations. Other systems such as (Üstün and Can, 2016) rely solely on evaluating two words from a semantic standpoint by the use of a twolayer neural network.
MORSE introduces semantic information into its morpheme segmenters via distributed word representations while also relying on orthographic features. Inspired by the work of (Soricut and Och, 2015), instead of merely evaluating semantic relatedness, we are the first to evaluate the morphological relationship via the difference vector of morphologically related words. Comparing the difference vectors of multiple pairs across the corpus following the same morphological relation, gives MORSE a vocabulary-wide evaluation of morphological relations learned.

System
The key limitation of previous frameworks that rely solely on orthographic features is the resulting over-segmentation. As an example, MDLbased frameworks segment "sing" to "s-ing" due to the high frequency of the morphemes: "s" and "ing". Our framework combines semantic relatedness with orthographic relatedness to eliminate such error. For the example mentioned, MORSE validates morphemes such as "s" and "ing" from an orthographic perspective, yet invalidates the relation between "s" and "sing" from a local and vocabulary-wide semantic perspective. Hence, MORSE will segment "jumping" as "jump-ing", and perform no segmentations on "sing".
To bring in semantic understanding into MORSE, we rely on word representations (Mikolov et al., 2013a;Pennington et al., 2014). These word representations capture the semantics of the vocabulary through statistics over the context in which they appear. Moreover, morphosyntactic regularities have been shown over these word representations, whereby pairs of words sharing the same relationship exhibit equivalent difference vectors (Mikolov et al., 2013b). For example, it is expected in the vector space of word representations that w jumping´ w jump « w playing´ w play , but w sing´ w s ff w playing´ w play .
As a high level description, we first learn all possible affix transformations (morphological rules) in the language from pairs of words from an orthographic standpoint. For example, the pair ("jump", "jumping") corresponds to the valid affix transformation φ suffix Ý ÝÝ Ñ "ing", and the pair ("slow", "slogan") corresponds to the invalid rule "w" suffix Ý ÝÝ Ñ "gan". Then we invalidate the rules, such as "w" suffix Ý ÝÝ Ñ "gan", that do not conform to the linear relation in the vector space. We also invalidate pairs of words which, due to randomness, are orthographically related via a valid rule although they are not morphologically related, such as ("on", "only"). Now we formalize the objects we learn in MORSE and the scores (orthographic and semantic) used for validation. This constitutes the training stage. Finally, we formalize the inference stage, where we use these objects and scores to perform morpheme segmentation.

Training Stage
Objects: • Rule set R made of all possible affix transformations in a language. R is populated via the following definition: • Support set SS r for a rule r P R consists of all pairs of words related via r on a surface level. SS r is populated via the following definition: An example support set of the rule "ing" suffix Ý ÝÝ Ñ "ed" would be {("playing", "played"), ("crafting", "crafted"),. . . }.

Scores:
• score r orth (r) is a vocabulary-wide orthographic confidence score for rule r P R. It reflects the validity of an affix transformation in a language from an orthographic perspective. This score is evaluated as score r orth (r) = |SS r |. • score r sem (r) is a vocabulary-wide semantic confidence score for rule r P R. It reflects the validity of an affix transformation in a language from a semantic perspective. This score is evaluated as: score r sem (r) = wide semantic confidence score for a pair of words (w 1 , w 2 ). The pair of words is related via r on an orthographic level, but the score reflects the validity of the morphological relation via r on a semantic level. This score is evaluated as: score w sem ((w 1 , w 2 ) P SS r ) = |{(w 3 , w 4 ): (w 3 , w 4 ) P SS r , w 1´ w 2 « w 3´ w 4 }|/|SS r |. In other words, it is the fraction of pairs of words in the support set that exhibit a similar linear relation as (w 1 , w 2 ) in the vector space.
• score loc sem ((w 1 , w 2 ) P SS r ) is a local semantic confidence score for a pair of words (w 1 , w 2 ). The pair of words is related via r on an orthographic level, but the score reflects the semantic relatedness between the pair. The score is evaluated as: score loc sem ((w 1 , w 2 ) P SS r ) = cos( w 1 , w 2 ).

Inference Stage
In this stage we perform morpheme segmentation using the knowledge gained from the first stage. We begin with some notation: let R add = {r : r P R, r = aff 1 In other words, we divide the rules to those where an affix is added (R add ) and to those where an affix is replaced (R rep ).
Given a word w to segment, we search for r˚, the solution to the following optimization problem 1 . The search space is limited to the rules that include w in their support set, a fairly small search space and the corresponding computation readily tractable: s. t. r P R add score r sem prq ą t r sem score r orth prq ą t r orth score w sem ppw 1 , wq P SS r q ą t w sem score loc sem ppw 1 , wq P SS r q ą t loc sem Where t 1 = {w sem, loc sem}, t 2 = {r sem, r orth}, and t r sem , t r orth , t w sem , t loc sem are hyperparameters of the system. Now given r˚= φ suffix Ý ÝÝ Ñ suf, w 1 is defined as w 1 rÝ Ñ w. Thus the algorithm segments w Ñ w 1 -suf. We treat prefixes similarly. Next, the algorithm iterates over w 1 . Figure  1 shows the segmentation process of the word "unhealthy" based on the sequentially retrieved r˚.
The reason we restrict our rule set to R add in the optimization problem is to avoid rules such as "er" suffix Ý ÝÝ Ñ "ing" like in ("player", "playing") leading to false segmentations such as "playing" Ñ "playering". Yet we cannot completely restrict our search to R add due to rules such as "y" Ñ "ies" in words like ("sky", "skies"). To be able to segment words such as "skies", we'd have to consider rules in R rep 1 r and w uniquely identify w 1 , and thus the search space is defined only over r.

Experiments
We conduct a variety of experiments to assess the performance of MORSE, and compare it with prior works. First, the performance is assessed intrinsically on the task of morpheme segmentation and against the most widely used morpheme segmenter: Morfessor. We evaluate the performance across three languages of varying morphology levels: English, Turkish, Finnish, with Finnish being the richest in morphology and English being the poorest. Second, we show the inadequacies of benchmarking gold datasets for this task and describe a new dataset that we create to address the inadequacy. Third, in order to highlight the effect of including semantic information, we compare MORSE against Morfessor on a set of words which should not be segmented from a semantic perspective although orthographically they seem to be segmentable (such as "freshman").
In all of our experiments (unless specified otherwise), we report precision and recall (and correspond F1 scores) with locations of morpheme boundaries being considered positives and the rest of the locations considered negatives. It should be noted that we disregard starting and ending positions of words, since they form trivial boundaries (Virpioja et al., 2011).

Setup
Both systems, Morfessor and MORSE, were trained on the same monolingual corpus: Wikipedia 2 (as of September 20, 2016) to control for affecting factors within the experiment. For each language considered, the respective Wikipedia dump was preprocessed using an available code 3 . We use Word2Vec (Mikolov et al., 2013a)

Morpho Challenge Dataset
As our first intrinsic experiment, we consider the Morpho Challenge (MC) gold segmentations available online 5 .
For every language, two datasets are supplied: training and development. For the purpose of our experiments, all systems use the development dataset as a test dataset, and the training dataset is used for tuning MORSE's hyperparameters. MC dataset sizes are reported in Table 1.

Semantically Driven Dataset
There are a variety of weaknesses in the MC dataset, specifically related to whether the segmentation is semantically appropriate or not. We introduce a new semantically driven dataset (SD17) for morpheme segmentation along with the methodology used for creation; this new dataset is publicly available in the canonical 6 and non-canonical 7 versions (Cotterell and Vieira, 2016). Non-compositional segmentation: One of the key requirements of morpheme segmentation is the compositionality of the meaning of the word from the meaning of its morphemes. This requirement is violated on multiple occasions in the MC dataset. One example from Table 2 is segmenting the word "business" into "busi-ness", which falsely assumes that "business" means the act of being busy. Such a segmentation might be consistent with the historic origin of the word, but with radical semantic changes over time, the segmentation no longer semantically represents the compo-

Word
Gold Segmentation freshman fresh man airline air line business' busi ness ' ahead a head adultery adult ery  and Yeniterzi, 2011). Not only does such a weakness contribute to false segmentations, but it also favors segmentation methods following the MDL principle.
Trivial instances: The second weakness in the MC dataset is due to abundance of trivial instances. These instances lack discriminating capability since all methods can easily predict them (Baker, 2001). These instances are comprised of genetive cases (such as teacher's) as well as hyphenated words (such as turning-point). For genetive cases, segmenting at the apostrophe leads to perfect precision and recall, and thus such instances are deemed trivial. In the case of hyphenated words, segmenting at the hyphen is a correct segmentation with a very high probability. In the MC tuning dataset, in 43 times out of 46, the hyphen was a correct indication of segmentation.
Other issues exist in the Morpho Challenge dataset although less abundantly. There are instances of wrong segmentations possibly due to human error. One example of such instance is "turning-point" segmented to "turning -point" instead of "turn ing -point". Another issue, which is hard to avoid, is ambiguity of segmentation boundaries. Take for example the word "strafed", the segmentations "straf-ed" and "strafe-d" are equally justified. In such situations, the MC dataset favors complete affixes rather than complete lemmas. This also favors MDL-based segmenters. We note that the MC dataset also provides segmentations in a canonical version such as "strafe-ed", yet for the sake of a fair comparison with Morfessor and all previously evaluated systems on the MC dataset, we consider only the former version of segmentations. Due to these reasons, we create a new dataset SD17 for English gold morpheme segmentations  with compositionality guiding the annotations. We select 2000 words randomly from the 10K most frequent words in the English Wikipedia dump and have them annotated by two proficient English speakers. The segmentation criterion was to segment the word to the largest extent possible while preserving its compositionality from the segments. The inter-annotator agreement reached 91% on a word level. Based on post annotation discussions, annotators agreed on 99% of the words, and words not agreed on were eliminated along with words containing non-alpha characters to avoid trivial instances.
SD17 is used to evaluate the performance of both Morfessor and MORSE. We claim that the performance on SD17 is a better indication of the performance of a morpheme segmenter. By the use of SD17 we expect to gain insights on the extent to which morpheme segmentation is a function of semantics in addition to orthography.

Handling Compositionality
We have hypothesized that following the MDL principle (such as Morfessor) leads to oversegmentation. This over-segmentation happens specifically when the meaning of the word does not follow from the meaning of its morphemes. Examples include words such as "red head", "duck face", "how ever", "s ing". A subset of these words are defined by linguists as exocentric compounds (Bauer, 2008). MORSE does not suffer from this issue owing to its use of a semantic model.
We use a collection of 100 English words which appear to be segmentable but actually are not (example: "however"). Such a collection will highlight a system's capability of distinguishing frequent letter sequences from the semantic contribution of this letter sequence in a word. We make this collection publicly available 8 .

Results
We compare MORSE with Morfessor, and place the performance alongside the state-of-the-art published results.

Morpho Challenge Dataset
As demonstrated in Table 3, MORSE performs better than Mofessor on English and Turkish, and worse on Finnish. Considering English first, using MORSE instead of Morfessor, resulted in a 6% absolute increase in F1 scores. This supports our claim for the need of semantic cues in morpheme segmentation, and also validates the method used in this paper. Since English is a less systematic language in terms of the orthographic structure of words, semantic cues are of greater need, and hence a system which relies on semantic cues is expected to perform better; indeed this is the case. Similarly, MORSE performs better on Turkish with a 7% absolute margin in terms of F1 score. On the other hand, Morfessor surpasses MORSE in performance on Finnish by a large margin as well, especially in terms of recall.

Discussion
We hypothesize that the richness of morphology in Finnish led to suboptimal performance of MORSE. This is because richness in morphology leads to word level sparsity which directly leads to: (1) Degradation of quality of word representations (2) Increased vocabulary size exacerbating the issue of limited vocabulary (recall MORSE was limited to a vocabulary of 1M). In a language with productive morphology, limiting its vocabulary results in a lower chance of finding morphologically related word pairs. This negatively im-pacts the training stage of MORSE which relies on the availability of such pairs. In order to detect the suffix "ly" from the word "cheerfully" MORSE needs to come across "cheerful" as well. Coming across "cheerful" is now a lower probability event due to high sparsity. This is not as much of an issue for Morfessor under the MDL principle, since it might detect "ly" just by coming across multiple words ending with "ly" even without encountering the base forms of those words. We show how the detection of rules is affected by considering the number of candidate rules detected as well as the number of candidate morphologically related word pairs detected. As shown in Table 4, the number of detected candidate rules and candidate related words decreases with the increase in morphology in a language. This confirms our hypothesis; we note that this issue can be directly attributed to the limited vocabulary size in MORSE. With the increase in processing power, and thus larger vocabulary coverage, MORSE is expected to perform better.

Semantically Driven Dataset
The performance of MORSE and Morfessor on SD17 is shown in Table 5. The use of MC data (which does not adhere to the compositionality principle) to tune MORSE to be evaluated on SD17 (which does adhere to the compositionality principle) is not optimal. Thus, we evaluate MORSE on SD17 using 5-fold cross validation, where 80% of the dataset is used to tune and 20% is used to evaluate. Precision, Recall, and F1 scores are averaged and reported in Table 5 using the label MORSE-CV.
Based on the results in Table 5, we make the following observations. Comparing MORSE-CV to MORSE reflects the fundamental difference between SD17 and MC datasets. Knowing the basis of construction of SD17 and the fundamental weaknesses in MC datasets, we attribute the performance increase to the lack of compositionality in MC dataset. Comparing MORSE-CV to Morfessor, we observe a significant jump in performance (an increase of 24%). In comparison, the increase on the MC dataset (6%) shows that the Morpho Challenge dataset underestimates the performance gap between Morfessor and MORSE due its inherent weaknesses.
Since MORSE is equipped with the capability to retrieve full morphemes even when not present   in full orthographically, a capability that Morfessor lacks, we evaluated both systems on the canonical version of SD17. The results are reported in Table 6. We notice that evaluating on the canonical form of SD17 gives a further edge for MORSE over Morfessor. For evaluation on the canonical version of SD17, we switch to morpheme-level evaluation instead of boundary-level as a more suitable method for Morfessor.
We next compare MORSE against published state-of-the-art results . As one can see in Table 7 MORSE significantly performs better than published state-of-the-art results, most notably (Narasimhan et al., 2015) referred to as LLSM in the Table. Comparison is also made against the top results in the latest Morpho Challenge: Morfessor S+W and Morfessor S+W+L (Kohonen et al., 2010), and Base Inference (Lignos, 2010).

Handling Compositionality
We compare the performance of MORSE and Morfessor on a set of words made up of morphemes which don't compose the meaning of the word. Since all the boundaries in this dataset are negative, to evaluate both systems (with MORSE tuned on SD17), we only report the number of seg-

Discussion
One of the benefits of MORSE against other frameworks such as MDL is its ability to identify the lemma within the segmentation. The lemma would be the last non-segmented word in the iterative process of segmentation. Hence, an advantage of our framework is its easy adaptability into a lemmatizer and even a stemmer.
Another key aspect which is not present in some of the competitive systems is the need for a small tuning dataset. This is a point in favor of completely unsupervised systems such as Morfessor. On the other hand, these hyperparameters could allow for flexibility. Figure 2 shows how precision and recall changes as a function of the hyperparameter selection 9 . As one would expect, increasing the hyperparameters, in general, leads to a stricter search space and thus increases precision and decreases recall. Putting these results in perspective, the user of MORSE is given the capability of controlling for precision and recall based on the needs of the downstream task.
Moreover, to check for the level of dependency of MORSE on a set of gold morpheme segmentations for tuning, we check for the variation in performance with respect to size of tuning data. 9 Only a subset of the hyperparameters is used for display purposes For the purpose of this experiment we take an 80-20 split of SD17 and vary the size of the tuning set. We notice that the performance (81.90% F1) reaches a steady state at 20% (« 300 gold segmentations) of the tuning data. This reflects the minimal dependency on a tuning dataset.
As for the inference stage of MORSE, the greedy inference approach limits its performance. In other words, a wrong segmentation at the beginning will propagate and result in consequent wrong segmentations. Also, MORSE's limitation to concatenative morphology decreases its efficacy on languages that include non-concatenative morphology. This opens the stage for further research on a more optimal inference stage and a more global modeling of orthographic morphological transformations.

Conclusions and Future Work
In this paper, we have presented MORSE, a first morpheme segmenter to consider semantic structure at this scale (local and vocabulary-wide). We show its superiority over state-of-the-art algorithms using intrinsic evaluation on a variety of languages. We also pinpointed the weaknesses in current benchmarking datasets, and presented a new dataset free of these weaknesses. With a relative increase in performance reaching 24% absolute increase over Morfessor, this work proves the significance of semantic cues as well as validates a new state-of-the-art morpheme segmenter. For future work, we plan to address the limitations of MORSE: minimal supervision, greedy inference, and concatenative orthographic model. Moreover, we plan to computationally optimize the training stage for the sake of wider adoption by the community.