Automatically Tailoring Unsupervised Morphological Segmentation to the Language

Morphological segmentation is beneficial for several natural language processing tasks dealing with large vocabularies. Unsupervised methods for morphological segmentation are essential for handling a diverse set of languages, including low-resource languages. Eskander et al. (2016) introduced a Language Independent Morphological Segmenter (LIMS) using Adaptor Grammars (AG) based on the best-on-average performing AG configuration. However, while LIMS worked best on average and outperforms other state-of-the-art unsupervised morphological segmentation approaches, it did not provide the optimal AG configuration for five out of the six languages. We propose two language-independent classifiers that enable the selection of the optimal or nearly-optimal configuration for the morphological segmentation of unseen languages.


Introduction
As natural language processing becomes more interested in many languages, including lowresource languages, unsupervised morphological segmentation remains an important area of study.For most of the languages of the world, we do not have morphologically annotated resources.However, many human language technologies profit from morphological segmentation, for example machine translation (Nguyen et al., 2010;Ataman et al., 2017) and speech recognition (Narasimhan et al., 2014).
In this paper, we build on previous work on unsupervised morphological segmentation using Adaptor Grammars (AGs) (Johnson, 2008;Sirts and Goldwater, 2013;Eskander et al., 2016), a type of nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs) (Johnson et al., 2007), where the PCFG is typically a morphological grammar that spec-ifies the word structure.Specifically, we extend the research proposed by Eskander et al. (2016), who investigate a large space of parameters when using Adaptor Grammars related to (i) the underlying context-free grammar and (ii) the use of a "Cascaded" system in which one grammar chooses affixes to be seeded into another in order to simulate the situation where scholar-knowledge is available.Their results on a development set of 6 languages (English, German, Finish, Turkish, Estonian and Zulu) show that the best performing AG-based configuration (grammar and learning setup) differ from language to language.For processing unseen languages, Eskander et al. (2016) proposed the Language-Independent Morphological Segmenter (LIMS) based on the best-on-average performing configuration when running leaveone-out cross validation on the development languages.
However, while LIMS works best on average and has been shown to outperform other stateof-the-art unsupervised morphological segmentation systems (Eskander et al., 2016), it is not the optimal configuration for any of the development languages except Zulu.Thus, in this paper we propose an approach to automatically select the optimal or nearly-optimal languageindependent configuration for the morphological segmentation of unseen languages.We train two classifiers on the development languages used by Eskander et al. (2016) to make choices for unseen languages (Section 3).We show that we can choose the best parameter settings for the six development languages in a leave-oneout cross validation, and also on an unseen test language (Arabic).

Problem Definition and Dataset
Adaptor Grammars (AGs) have been used successfully for unsupervised morphological seg- mentation (Johnson, 2008;Sirts and Goldwater, 2013;Eskander et al., 2016), which is the task of breaking down words in a language into a sequence of morphs.An AG model typically has two main components: a PCFG and an adaptor that adapts the probabilities assigned to individual subtrees in the grammar.For the task of morphological segmentation, a PCFG is typically a morphological grammar that specifies word structure.Given a list of input strings, AGs can learn latent tree structures.Eskander et al. (2016) developed several AG models based on different underlying contextfree grammars and learning settings, which we briefly introduce below.
Grammars.Eskander et al. (2016) introduce a set of 9 grammars (see Table 1) designed based on three dimensions: 1) how the grammar generates the prefix, stem and suffix (morph vs. tripartite), 2) the levels which are represented in nonterminals (e.g., compounds, morphs and sub-morphs) and 3) the levels at which the segmentation into output morphs is produced.For example, in the PrStSu+SM grammar a word is modeled as a prefix, a stem and a suffix, where the prefix and suffix are sequences of zero or more morphs, while a morph is a sequence of sub-morphs, and the segmentation is based on the prefix, suffix and stem level.The PrStSu2a+SM grammar is similar, but a word is modeled as a prefix and stem-suffix sequence, where the prefix is optional, and stem-suffix is either a stem or a stem and a suffix (see Eskander et al. (2016) for more details).Figure 1 shows the trees for segmenting the word replayings using the PrStSu+SM and PrStSu2a+SM grammars.
Learning Settings.Eskander et al. (2016) consider three learning settings: Standard (Std), Scholar-Seeded Knowledge (Sch) and Cascaded (Casc).In the Standard setting, no scholar knowledge is introduced in the grammars, while in the Scholar-Seeded Knowledge setting the grammars are augmented with scholar knowledge in the form of information about affixes gathered from grammar books (before learning happens).The Cascaded setting approximates the effect of scholar-seeded knowledge by first using a high-precision AG to derive a set of affixes and then insert those affixes into the grammars used in a second learning step.Eskander et al. (2016) show that the segmentation performance differs significantly across the different grammars, learning settings and languages.For instance, the best performance for German is obtained by running the Standard PrStSu+SM configuration, while the Cascaded PrStSu2a+SM configuration produces the best segmentation for Finnish.That means, there is no setup that yields the optimal segmentation for all languages.For the processing of an unseen language (i.e., not part of the development), Eskander et al. (2016) recommend using the Cascaded PrStSu+SM configuration (referred to as LIMS: Language-Independent Morphological Segmenter), as it is the best-on-average performing one when running leave-one-out cross validation on the development languages.
Problem definition.While LIMS works best on average, it is not the optimal configuration for any of the development languages except Zulu.Thus, in this paper, we address the problem of automatically selecting the optimal or nearly-optimal language-independent (Standard or Cascaded) configuration for the morphological segmentation of unseen languages.
We use the 6 development languages used by Eskander et al. (2016) as well as Arabic as a fully unseen language.The data for English, German, Finnish, Turkish and Estonian is from Morpho Challenge1 , and the data for Zulu is from the Ukwabelana corpus (Spiegler et al., 2010).For the  unseen language we choose Arabic as it belongs to the Semitic family, while none of the development languages does.We obtain the Arabic data by randomly selecting 50K words from the PATB corpus (Maamourio et al., 2004).Table 2 lists the sources and sizes of our corpora.

Method
Since we have nine grammars to choose from (see

Feature Generation
In order to generate morphological features for the classification tasks, we run a phase of AG segmentation using the Standard PrStSu+SM configuration, where we only run 50 optimization iterations (i.e., one tenth of the number of iterations in a complete segmentation process as  reported by Eskander et al. (2016)), as the purpose is to quickly generate morphological clues that help the classification rather than to obtain highly optimized segmentation.We choose this particular configuration due to its high efficiency across all languages in addition to its relatively small execution time.Upon generating the initial segmentation, we extract 14 morphological features for classification.The features are listed in Table 4.We only consider affixes that appear more than 10 times in the segmentation output, where a simple affix contains only one morpheme, while a complex affix contains one or more simple affixes.

Classification
We experiment with three classification methods; K-Nearest Neighbors (KNN), Naive Bayes (NB) and Random Forest (RF) for both the Approach (Std vs. Casc) and Grammar (PrStSu+SM vs. PrStSu2a+SM) classification tasks.We conduct the two classification tasks separately, and then we combine the outcome to obtain the best configuration.
In the training phase, we perform leave-oneout cross validation on the six development languages.In each of the six folds of the cross validation, we choose one language in turn as the test language.We use the training and de-velopment corpora listed in table 2 for training the models and evaluating the classifiers, respectively.
Table 5 shows the final system output after combining the outcomes from the Approach classification and Grammar Classification.KNN predicts the right configuration consistently, while NB picks the wrong grammars for Finnish and Estonian, and RF predicts the wrong approach and grammar for Estonian.Thus, the overall accuracies of KNN, NB and RF are 100%, 66.7% and 88.3%, respectively, which suggests using KNN for classification.So for an unseen language, we first run the Standard PrTuSu+SM configuration for 50 optimization iterations to obtain the morphological features.We then run the KNN classifier on those features in order to obtain the final AG configuration.
Studying the correlation between the morphological features and the output shows that features F14, F07, F11 and F03 in table 4, are the most significant ones for the selection of the best configuration.This illustrates the high reliance on information about suffixes as three out of the four features, namely F14, F07 and F03, are suffix-related.

Evaluation
We report results using the EMMA F-measure score (Spiegler and Monson, 2010).
Results on an unseen language.We evaluate our system on Arabic, a language that is not part of the development of the system.Arabic also belongs to the Semitic family, where none of the development languages does.For an unseen language, we first run the Standard PrStSu+SM configuration for 50 optimization iterations to obtain the morphological features.We then run the KNN classifier on those features in order to obtain the final AG configuration.Table 6 lists the EMMA F-scores for Arabic for all grammars in both the Standard and Cascaded setups.Our KNN classifier picks the Standard PrStSu+SM configuration, which yields the best segmentation among all the configurations with an EMMA F-score of 0.701.
Comparison with existing unsupervised approaches.Table 7 compares the performance of the selected configurations of our system (Table 5) to three other systems; Morfessor (Creutz and Lagus, 2007), MorphoChain (Narasimhan et al., 2015) and LIMS (Eskander et al., 2016) (where the cascaded PrStSu+SM configuration is  chosen).Our system has EMMA F-score error reductions of 17.1%, 29.2% and 6.3% over Morfessor, MorphoChain2 and LIMS, respectively, on average across the development languages and Arabic.It is also only 0.003 of average EMMA Fscore behind an oracle system, where the best configuration is always selected (indicated as Best).We are not able to compare versus the system presented by Wang et al. (2016) as neither their system nor their data is currently available.

Related Work
The first work that utilizes AGs for unsupervised morphological segmentation is introduced by Johnson (2008), while Sirts and Goldwater (2013) propose minimally supervised AG models of different tree structures for morphological segmentation.The most recent work on using AGs for morphological segmentation is proposed by Eskander et al. (2016), where they experiment with several AG models based on different underlying grammars and learning settings.They also research the use of scholar knowledge seeded in the grammar trees.This knowledge could be gathered from grammar books or automatically generated via bootstrapping.This paper extends their work by proposing a machine learning ap-proach to select the best language-independent model for each language.
In addition to the use of AGs, several models have been successfully used for unsupervised morphological segmentation such as generative probabilistic models (utilized by Morfessor (Creutz and Lagus, 2007)), and log-linear models using contextual and global features (Poon et al., 2009).Narasimhan et al. (2015) use a discriminative model for unsupervised morphological segmentation that integrates orthographic and semantic properties of words.The model learns morphological chains, where a chain extends a base form available in the lexicon.
Another recent notable work is introduced by Wang et al. (2016), who use neural networks for unsupervised segmentation, where they build LSTM (Hochreiter and Schmidhuber, 1997) architectures to learn word structures in order to predict morphological boundaries.Another variation of the this approach is presented by Yang et al. (2017), where they use partial-word information as character bigram embeddings and evaluate their work on Chinese.

Conclusion and Future Work
We have shown that our language-independent classifiers improve the state-of-the-art unsupervised morphological segmentation proposed by Eskander et al. (2016) by making choices that optimize for a given language, rather than choosing parameters for all languages based on averages on the development languages.
In future work, we plan to conduct an extrinsic evaluation on tasks that could benefit from morphological segmentation such as machine translation, information retrieval and summarization.We also plan to optimize the segmentation models for those specific tasks.

Figure 1 :
Figure 1: Grammar trees for the word replayings: (a) PrStSu+SM, (b) PrStSu2a+SM Monson, 2010), and is a metric that has been shown to be particularly adequate for evaluating unsupervised methods for morphological segmentation and superior to the metric used in the Morpho Challenge competition series.We use a supervised machine learning approach to select the best configuration.Since we only have six development languages, we split the classification task into two binary classification ones: Approach Classification (Standard (Std) vs. Cascaded (Casc)) and Grammar Classification (PrStSu+SM vs. PrStSu2a+SM), and run leave-one-out cross validation on the development languages for both tasks.

Table 1 :
Grammar Representations.Compound = Upper level representation of the word as a sequence of compounds; Morph = Affix/Morph representation as a sequence of morphs.SubMorph (SM) = Lower level representation of characters as a sequence of sub-morphs."+" denotes one or more and "?" denotes optional.

Table 2 :
Data source and size information.TRAIN = training corpus, DEV = development corpus and TEST = test corpus.

Best Configuration Approach class Grammar class
Table 3 lists the best configurations and the gold class labels (for both Approach and Grammar) for the six development languages.

Table 3 :
The best configurations and the gold class labels for both the Approach classification and Grammar classification for the six development languages.

Table 6 :
Adaptor-grammar results (Emma F-scores) for the Standard and Cascaded setups for Arabic.Boldface indicates the best configuration and the choice of our system.

Table 7 :
The performance of our system (Ours) compared to Morfessor, MorphoChain, LIMS and an upper-bound system (Best), using EMMA F-scores.