Neural Morphological Analysis: Encoding-Decoding Canonical Segments

Canonical morphological segmentation aims to divide words into a sequence of standardized segments. In this work, we propose a character-based neural encoder-decoder model for this task. Additionally, we extend our model to include morpheme-level and lexical information through a neural reranker. We set the new state of the art for the task improving previous results by up to 21% accuracy. Our experiments cover three languages: English, German and Indonesian.


Introduction
Morphological segmentation aims to divide words into morphemes, meaning-bearing sub-word units. Indeed, segmentations have found use in a diverse set of NLP applications, e.g., automatic speech recognition (Afify et al., 2006), keyword spotting (Narasimhan et al., 2014), machine translation (Clifton and Sarkar, 2011) and parsing (Seeker and Ç etinoglu, 2015). In the literature, most research has traditionally focused on surface segmentation, whereby a word w is segmented into a sequence of substrings whose concatenation is the entire word; see Ruokolainen et al. (2016) for a survey. In contrast, we consider canonical segmentation: w is divided into a sequence of standardized segments. To make the difference concrete, consider the following example: the surface segmentation of the complex English word achievability is achiev+abil+ity, whereas its canonical segmentation is achieve+able+ity, i.e., we restore the alterations made during word formation.
Canonical versions of morphological segmentation have been introduced multiple times in the literature (Kay, 1977;Naradowsky and Goldwater, 2009;Cotterell et al., 2016). Canonical segmentation has several representational advantages over surface segmentation, e.g., whether two words share a morpheme is no longer obfuscated by orthography. However, it also introduces a hard algorithmic challenge: in addition to segmenting a word, we must reverse orthographic changes, e.g., mapping achievability →achieveableity.
Computationally, canonical segmentation can be seen as a sequence-to-sequence problem: we must map a word form to a canonicalized version with segmentation boundaries.
Inspired by the recent success of neural encoder-decoder models (Sutskever et al., 2014) for sequence-to-sequence problems in NLP, we design a neural architecture for the task. However, a naïve application of the encoder-decoder model ignores much of the linguistic structure of canonical segmentation-it cannot directly model the individual canonical segments, e.g., it cannot easily produce segment-level embeddings. To solve this, we use a neural reranker on top of the encoder-decoder, allowing us to embed both characters and entire segments. The combined approach outperforms the state of the art by a wide margin (up to 21% accuracy) in three languages: English, German and Indonesian.

Neural Canonical Segmentation
We begin by formally describing the canonical segmentation task.
We take a probabilistic approach and, thus, attempt to learn a distribution p(c | w). Our model consists of two parts. First, we apply an encoderdecoder recurrent neural network (RNN) (Bahdanau et al., 2014) to the sequence of characters of the input word to obtain candidate canonical segmentations. Second, we define a neural reranker that allows us to embed individual morphemes and chooses the final answer from within a set of candidates generated by the encoder-decoder.

Neural Encoder-Decoder
Our encoder-decoder is based on Bahdanau et al. (2014)'s neural machine translation model. 1 The encoder is a bidirectional gated RNN (GRU) (Cho et al., 2014b). Given a word w ∈ Σ * , the input to 1 github.com/mila-udem/blocks-examples/tree/master/machine_ translation the encoder is the sequence of characters of w, represented as one-hot vectors. The decoder defines a conditional probability distribution over c ∈ Ω * given w: where g is a nonlinear activation function, s t is the state of the decoder at t and a t is a weighted sum of the |w| states of the encoder. The state of the encoder for w i is the concatenation of forward and backward hidden states − → h i and ← − h i for w i . An overview of how the attention weight and the weighted sum a t are included in the architecture can be seen in Figure  1. The attention weights α t,i at each timestep t are computed based on the respective encoder state and the decoder state s t . See Bahdanau et al. (2014) for further details.

Neural Reranker
The encoder-decoder, while effective, predicts each output character in Ω sequentially. It does not use explicit representations for entire segments and is incapable of incorporating simple lexical information, e.g., does this canonical segment occur as an independent word in the lexicon? Therefore, we extend our model with a reranker.
The reranker rescores canonical segmentations from a candidate set, which in our setting is sampled from p ED . Let the sample set be We define the neural reranker as . .+σ n ) and v σ i is a one-hot morpheme embedding of σ i with an additional binary dimension marking if σ i occurs independently as a word in the language. 2 The partition function is Z θ (w) and the parameters are θ = {u, W, τ }. The parameters W and u are projection and hidden layers, respectively, of a multi-layered perceptron and τ can be seen as a temperature parameter that anneals the encoder-decoder model p ED (Kirkpatrick, 1984). We define the partition function over the sample set S w : The reranking model's ability to embed morphemes is important for morphological segmentation since we often have strong corpus-level signals. The reranker also takes into account the characterlevel information through the score of the encoderdecoder model. Due to this combination we expect stronger performance.

Related Work
Various approaches to morphological segmentation have been proposed in the literature. In the unsupervised realm, most work has been based on the principle of minimum description length (Cover and Thomas, 2012), e.g., LINGUISTICA (Goldsmith, 2001;Lee and Goldsmith, 2016) or MORFESSOR (Creutz and Lagus, 2002;Creutz et al., 2007;Poon et al., 2009). MORFESSOR was later extended to a semi-supervised version by Kohonen et al. (2010). Supervised approaches have also been considered. Most notably, Ruokolainen et al. (2013) developed a supervised approach for morphological segmentation based on conditional random fields (CRFs) which they later extended to work also in a semisupervised way (Ruokolainen et al., 2014) using letter successor variety features (Hafer and Weiss, 1974). Similarly, Cotterell et al. (2015) improved performance with a semi-Markov CRF.
More recently, Wang et al. (2016) achieved stateof-the-art results on surface morphological segmentation using a window LSTM. Even though Wang et al. (2016) also employ a recurrent neural network, we distinguish our approach, in that we focus on canonical morphological segmentation, rather than surface morphological segmentation.
Naturally, our approach is also relevant to other applications of recurrent neural network transduction models (Sutskever et al., 2014;Cho et al., 2014a). In addition to machine translation (Bahdanau et al., 2014), these models have been success-fully applied to many areas of NLP, including parsing (Vinyals et al., 2015), morphological reinflection (Kann and Schütze, 2016) and automatic speech recognition (Graves and Schmidhuber, 2005;Graves et al., 2013).

Experiments
To enable comparison to earlier work, we use a dataset that was prepared by Cotterell et al. (2016) for canonical segmentation. 3

Languages
The dataset we work on covers 3 languages: English, German and Indonesian. English and German are West Germanic Languages, with the former being an official languages in nearly 60 different states and the latter being mainly spoken in Western Europe. Indonesian -or Bahasa Indonesia-is the official language of Indonesia. Cotterell et al. (2016) report the best experimental results for Indonesian, followed by English and finally German. The high error rate for German might be caused by it being rich in orthografic changes. In contrast, Indonesian morphology is comparatively simple.

Corpora
The data for the English language was extracted from segmentations derived from the CELEX database (Baayen et al., 1993). The German data was extracted from DerivBase (Zeller et al., 2013), which provides a collection of derived forms together with the transformation rules, which were used to create the canonical segmentations. Finally, the data for Bahasa Indonesia was collected by using the output of the MORPHIND analyzer (Larasati et al., 2011), together with an open-source corpus of Indonesian. For each language we used the 10,000 forms that were selected at random by Cotterell et al. (2016) from a uniform distribution over types to form the corpus. Following them, we perform our experiments on 5 splits of the data into 8000 training forms, 1000 development forms and 1000 test forms and report averages.

Training
We train an ensemble of five encoder-decoder models. The encoder and decoder RNNs each have 100 hidden units. Embedding size is 300. We use ADADELTA (Zeiler, 2012) with a minibatch size of 20. We initialize all weights (encoder, decoder, embeddings) to the identity matrix and the biases to zero (Le et al., 2015). All models are trained for 20 epochs. The hyperparameter values are taken from Kann and Schütze (2016) and kept unchanged for the application to canonical segmentation described here.
To train the reranking model, we first gather the sample set S w on the training data. We take 500 individual samples, but (as we often sample the same form multiple times) |S w | ≈ 5.
We optimize the log-likelihood of the training data using ADADELTA. For generalization, we employ L 2 regularization and we perform grid search to determine the coefficient λ ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. To decode the model, we again take 500 samples to populate S w and select the best segmentation. Baselines. Our first baseline is the joint transduction and segmentation model (JOINT) of Cotterell et al. (2016). It is the current state of the art on the datasets we use and the task of canonical segmentation in general. This model uses a jointly trained, separate transduction and segmentation component. Importantly, the joint model of Cotterell et al. (2016) already contains segment-level features. Thus, reranking this baseline would not provide a similar boost.
Our second baseline is a weighted finite-state transducer (WFST) (Mohri et al., 2002) with a loglinear parameterization (Dreyer et al., 2008), again, taken from Cotterell et al. (2016). The WFST baseline is particularly relevant because, like our encoder-decoder, it formulates the problem directly as a string-to-string transduction. Evaluation Metrics. We follow Cotterell et al. (2016) and use the following evaluation measures: error rate, edit distance and morpheme F 1 . Error rate is defined as 1 minus the proportion of guesses that are completely correct. Edit distance is the Levenshtein distance between guess and gold standard. For this, guess and gold are each represented as one string with a distinguished character denoting the segment boundaries. Morpheme F 1 compares the morphemes in guess and gold. Precision (resp. recall) is the proportion of morphemes in guess (resp. gold) that occur in gold (resp. guess).

Results
The results of the canonical segmentation experiment in Table 1 show that both of our models improve over all baselines. The encoder-decoder alone has a .02 (English), .15 (German) and .01 (Indonesion) lower error rate than the best baseline. The encoder-decoder improves most for the language for which the baselines did worst. This suggests that, for more complex languages, a neural network model might be a good choice. The reranker achieves an additional improvement of .04 to .06. for the error rate. This is likely due to the additional information the reranker has access to: morpheme embeddings and existing words.
Important is also the upper bound we report. It shows the maximum performance the reranker could achieve, i.e., evaluates the best solution that appears in the set of candidate answers for the reranker. The right answer is contained in ≥ 94% of samples. Note that, even though the upper bound goes up with the number of samples we take, there is no guarantee for any finite number of samples that they will contain the true answer. Thus, we would need to take an infinite number of samples to get a perfect upper bound. However, as the current upper bound is quite high, the encoder-decoder proves to be an appropri-ate model for the task. Due to the large gap between the performance of the encoder-decoder and the upper bound, a better reranker could further increase performance. We will investigate ways to improve the reranker in future work. Error analysis. We give for representative samples the error (E for the segmentation produced by our method) and the correct analysis (G for gold).
We first analyze cases in which the right answer does not appear at all in the samples drawn from the encoder-decoder.
Those include problems with umlauts in German (G: verflüchtigen → ver+flüchten+ig, E: verflucht+ig) and orthographic changes at morpheme boundaries (G:cutter →cut+er, E: cutter or cutt+er, sampled with similar frequency). There are also errors that are due to problems with the annotation, e.g., the following two gold segmentations are arguably incorrect: tec →detective and syrerin →syr+er+in (syr is neither a word nor an affix in German).
In other cases, the encoder-decoder does find the right solution (G), but gives a higher probability to an incorrect analysis (E). Examples are a wrong split into adjectives or nouns instead of verbs (G: fügsamkeit →fügen+sam+keit, E: fügsam+keit), the other way around (G: zähler →zahl+er, E: zählen+er), cases where the wrong morphemes are chosen (G: precognition →pre+cognition, E: precognit+ion), difficult cases where letters have to be inserted (G: redolence →redolent+ence, E: re+dolence) or words the model does not split up, even though they should be (G: additive → addition+ive, E: additive).
Based on its access to lexical information and morpheme embeddings, the reranker is able to correct some of the errors made by the encoderdecoder. Samples are G: geschwisterpärchen → geschwisterpaar+chen, E: geschwisterpar+chen (geschwisterpaar is a word in German but geschwisterpar is not) or G: zickig → zicken+ig, E: zick+ig (with zicken, but not zick, being a German word).
Finally, we want to know if segments that appear in the test set without being present in the training set are a source of errors. In order to investigate that, we split the test samples into two groups: The first group contains the samples for which our system finds the right answer. The second one contains all other samples. We compare the percentage of wrong samples right samples 27.33 (.02) 36.60 (.01) samples that do not appear in the training data for both groups. We exemplarily use the German data and the results results are shown in Table 2. First, it can be seen that very roughly about a third of all segments does not appear in the training data. This is mainly due to unseen lemmas as their stems are naturally unknown to the system. However, the correctly solved samples contain nearly 10% more unseen segments. As the average number of segments per word for wrong and right solutions -2.44 and 2.11, respectively -does not differ by much, it seems unlikely that many errors are caused by unknown segments.

Conclusion and Future Work
We developed a model consisting of an encoderdecoder and a neural reranker for the task of canonical morphological segmentation. Our model combines character-level information with features on the morpheme level and external information about words. It defines a new state of the art, improving over baseline models by up to .21 accuracy, 16 points F 1 and .77 Levenshtein distance. We found that ≥ 94% of correct segmentations are in the sample set drawn from the encoderdecoder model, demonstrating the upper bound on the performance of our reranker is quite high; in future work, we hope to develop models to exploit this.