From Characters to Words to in Between: Do We Capture Morphology?

Words can be represented by composing the representations of subword units such as word segments, characters, and/or character n-grams. While such representations are effective and may capture the morphological regularities of words, they have not been systematically compared, and it is not understood how they interact with different morphological typologies. On a language modeling task, we present experiments that systematically vary (1) the basic unit of representation, (2) the composition of these representations, and (3) the morphological typology of the language modeled. Our results extend previous findings that character representations are effective across typologies, and we find that a previously unstudied combination of character trigram representations composed with bi-LSTMs outperforms most others. But we also find room for improvement: none of the character-level models match the predictive accuracy of a model with access to true morphological analyses, even when learned from an order of magnitude more data.


Introduction
Continuous representations of words learned by neural networks are central to many NLP tasks (Cho et al., 2014;Chen and Manning, 2014;. However, directly mapping a finite set of word types to a continuous representation has well-known limitations. First, it makes a closed vocabulary assumption, enabling only generic out-of-vocabulary handling. Second, it cannot exploit systematic functional relationships in learning. For example, cat and cats stand in the same relationship as dog and dogs. While this relationship might be discovered for these specific frequent words, it does not help us learn that the same relationship also holds for the much rarer words sloth and sloths. These functional relationships reflect the fact that words are composed from smaller units of meaning, or morphemes. For instance, cats consists of two morphemes, cat and -s, with the latter shared by the words dogs and tarsiers. Modeling this effect is crucial for languages with rich morphology, where vocabulary sizes are larger, many more words are rare, and many more such functional relationships exist. Hence, some models produce word representations as a function of subword units obtained from morphological segmentation or analysis (Luong et al., 2013;Botha and Blunsom, 2014;Cotterell and Schütze, 2015). A downside of these models is that they depend on morphological segmenters or analyzers.
Morphemes typically have similar orthographic representations across words. For example, the morpheme -s is realized as -es in finches. Since this variation is limited, the general relationship between morphology and orthography can be exploited by composing the representations of characters Kim et al., 2016), character n-grams (Sperr et al., 2013;Wieting et al., 2016;Bojanowski et al., 2016;Botha and Blunsom, 2014), bytes (Plank et al., 2016;Gillick et al., 2016), or combinations thereof (Santos and Zadrozny, 2014;Qiu et al., 2014). These models are compact, can represent rare and unknown words, and do not require morphological analyzers. They raise a provocative question: Does NLP benefit from models of morphology, or can they be replaced entirely by models of characters?
The relative merits of word, subword. and character-level models are not fully understood because each new model has been compared on different tasks and datasets, and often compared against word-level models. A number of questions remain open: 1. How do representations based on morphemes compare with those based on characters?
2. What is the best way to compose subword representations? 3. Do character-level models capture morphology in terms of predictive utility?
4. How do different representations interact with languages of different morphological typologies?
The last question is raised by Bender (2013): languages are typologically diverse, and the behavior of a model on one language may not generalize to others. Character-level models implicitly assume concatenative morphology, but many widely-spoken languages feature nonconcatenative morphology, and it is unclear how such models will behave on these languages.
To answer these questions, we performed a systematic comparison across different models for the simple and ubiquitous task of language modeling. We present experiments that vary (1) the type of subword unit; (2) the composition function; and (3) morphological typology. To understand the extent to which character-level models capture true morphological regularities, we present oracle experiments using human morphological annotations instead of automatic morphological segments. Our results show that: 1. For most languages, character-level representations outperform the standard word representations. Most interestingly, a previously unstudied combination of character trigrams composed with bi-LSTMs performs best on the majority of languages.
2. Bi-LSTMs and CNNs are more effective composition functions than addition.
3. Character-level models learn functional relationships between orthographically similar words, but don't (yet) match the predictive accuracy of models with access to true morphological analyses.
4. Character-level models are effective across a range of morphological typologies, but orthography influences their effectiveness.

Morphological Typology
A morpheme is the smallest unit of meaning in a word. Some morphemes express core meaning (roots), while others express one or more dependent features of the core meaning, such as person, gender, or aspect. A morphological analysis identifies the lemma and features of a word. A morph is the surface realization of a morpheme (Morley, 2000), which may vary from word to word. These distinctions are shown in Table 1.
Morphological typology classifies languages based on the processes by which morphemes are composed to form words. While most languages will exhibit a variety of such processes, for any given language, some processes are much more frequent than others, and we will broadly identify our experimental languages with these processes.
When morphemes are combined sequentially, the morphology is concatenative. However, morphemes can also be composed by nonconcatenative processes.
We consider four broad categories of both concatenative and nonconcatenative processes in our experiments.
Fusional languages realize multiple features in a single concatenated morpheme. For example, English verbs can express number, person, and tense in a single morpheme: wanted (English) want + ed want + VB+1st+SG+Past Agglutinative languages assign one feature per morpheme. Morphemes are concatenated to form a word and the morpheme boundaries are clear. For example (Haspelmath, 2010): okursam (Turkish) oku+r+sa+m "read"+AOR+COND+1SG Root and Pattern Morphology forms words by inserting consonants and vowels of dependent morphemes into a consonantal root based on a given pattern. For example, the Arabic root ktb ("write") produces (Roark and Sproat, 2007): katab "wrote" (Arabic) takaatab "wrote to each other" (Arabic) Reduplication is a process where a word form is produced by repeating part or all of the root to express new features. For example: anak "child" (Indonesian) anak-anak "children" (Indonesian) buah "fruit" (Indonesian) buah-buahan "various fruits" (Indonesian)

Representation Models
We compare ten different models, varying subword units and composition functions that have commonly been used in recent work, but evaluated on various different tasks (Table 2). Given word w, we compute its representation w as: where σ is a deterministic function that returns a sequence of subword units; W s is a parameter matrix of representations for the vocabulary of subword units; and f is a composition function which takes σ(w) and W s as input and returns w. All of the representations that we consider take this form, varying only in f and σ.

Subword Units
We consider four variants of σ in Equation 1, each returning a different type of subword unit: character, character trigram, or one of two types of morph. Morphs are obtained from Morfessor (Smit et al., 2014) or a word segmentation based on Byte Pair Encoding (BPE; Gage (1994)), which has been shown to be effective for handling rare words in neural machine translation (Sennrich et al., 2016). BPE works by iteratively replacing frequent pairs of characters with a single unused character. For Morfessor, we use default parameters while for BPE we set the number of merge operations to 10,000. 1 When we segment into character trigrams, we consider all trigrams in the word, including those covering notional beginning and end of word characters, as in Sperr et al. (2013). Example output of σ is shown in Table 3.

Composition Functions
We use three variants of f in Eq. 1. The first constructs the representation w of word w by adding 1 BPE takes a single parameter: the number of merge operations. We tried different parameter values (1k, 10k, 100k) and manually examined the resulting segmentation on the English dataset. Qualitatively, 10k gave the most plausible segmentation and we used this setting across all languages. the representations of its subwords s 1 , . . . , s n = σ(w), where the representation of s i is vector s i .
The only subword unit that we don't compose by addition is characters, since this will produce the same representation for many different words. Our second composition function is a bidirectional long-short-term memory (bi-LSTM), which we adapt based on its use in the characterlevel model of  and its widespread use in NLP generally. Given s i and the previous LSTM hidden state h i−1 , an LSTM (Hochreiter and Schmidhuber, 1997) computes the following outputs for the subword at position i: whereŝ i+1 is the predicted target subword, g is the softmax function and V is a weight matrix. A bi-LSTM (Graves et al., 2005) combines the final state of an LSTM over the input sequence with one over the reversed input sequence. Given the hidden state produced from the final input of the forward LSTM, h f w n and the hidden state produced from the final input of the backward LSTM h bw 0 , we compute the word representation as: where W f , W b , and b are parameter matrices and h f w n and h bw 0 are forward and backward LSTM states, respectively.
The third composition function is a convolutional neural network (CNN) with highway layers, as in Kim et al. (2016). Let c 1 , . . . , c k be the sequence of characters of word w. The character embedding matrix is C ∈ R d×k , where the i-th column corresponds to the embeddings of c i . We first apply a narrow convolution between C and a filter F ∈ R d×n of width n to obtain a feature map f ∈ R k−n+1 . In particular, the computation of the j-th element of f is defined as where A, B = Tr(AB T ) is the Frobenius inner product and b is a bias. The CNN model applies filters of varying width, representing features Models Subword Unit(s) Composition Function Sperr et al. (2013) words, character n-grams addition Luong et al. (2013) morphs (Morfessor) recursive NN Botha and Blunsom (2014) words, morphs (Morfessor) addition Qiu et al. (2014) words, morphs (Morfessor) addition Santos and Zadrozny (2014) words, characters CNN Cotterell and Schütze (2015) words, morphological analyses addition Sennrich et al. (2016) morphs (BPE) none Kim et al. (2016) characters CNN Ling et al. (2015) characters bi-LSTM Wieting et al. (2016) character n-grams addition Bojanowski et al. (2016) character n-grams addition Vylomova et al. (2016) characters, morphs (Morfessor) bi-LSTM, CNN Miyamoto and Cho (2016) words, characters bi-LSTM Rei et al. (2016) words, characters bi-LSTM Lee et al. (2016) characters CNN Kann and Schütze (2016) characters, morphological analyses none Heigold et al. (2017) words, characters bi-LSTM, CNN  of character n-grams. We then calculate the maxover-time of each feature map.
and concatenate them to derive the word representation w t = [y 1 , . . . , y m ], where m is the number of filters applied. Highway layers allow some dimensions of w t to be carried or transformed. Since it can learn character n-grams directly, we only use the CNN with character input.

Language Model
We use language models (LM) because they are simple and fundamental to many NLP applications. Given a sequence of text s = w 1 , . . . , w T , our LM computes the probability of s as: where y t = w t if w t is in the output vocabulary and y t = UNK otherwise. Our language model is an LSTM variant of recurrent neural network language (RNN) LM (Mikolov et al., 2010). At time step t, it receives input w t and predicts y t+1 . Using Eq. 1, it first computes representation w t of w t . Given this representation and previous state h t−1 , it produces a new state h t and predicts y t+1 : where g is a softmax function over the vocabulary yielding the probability in Equation 8. Note that this design means that we can predict only words from a finite output vocabulary, so our models differ only in their representation of context words. This design makes it possible to compare language models using perplexity, since they have the same event space, though open vocabulary word prediction is an interesting direction for future work. The complete architecture of our system is shown in Figure 1, showing segmentation function σ and composition function f from Equation 1.

Experiments
We perform experiments on ten languages (Table 4). We use datasets from  for English and Turkish. For Czech and Russian we use Universal Dependencies (UD) v1.3 (Nivre et al., 2015). For other languages, we use preprocessed Wikipedia data (Al-Rfou et al., 2013). 2 For each dataset, we use approximately 1.2M tokens to train, and approximately 150K tokens each for development and testing. Preprocessing involves lowercasing (except for character models) and removing hyperlinks.
To ensure that we compared models and not implementations, we reimplemented all models in a single framework using Tensorflow (Abadi et al., 2015). 3 We use a common setup for all experiments based on that of , Kim et al. (2016), and Miyamoto and Cho (2016). In preliminary experiments, we confirmed that our models produced similar patterns of perplexities for the reimplemented word and character LSTM 2 The Arabic and Hebrew dataset are unvocalized. Japanese mixes Kanji, Katakana, Hiragana, and Latin characters (for foreign words). Hence, a Japanese character can correspond to a character, syllable, or word. The preprocessed dataset is already word-segmented. 3 Our implementation of these models can be found at https://github.com/claravania/subword-lstm-lm models of . Even following detailed discussion with Ling (p.c.), we were unable to reproduce their perplexities exactly-our English reimplementation gives lower perplexities; our Turkish higher-but we do reproduce their general result that character bi-LSTMs outperform word models. We suspect that different preprocessing and the stochastic learning explains differences in perplexities. Our final model with bi-LSTMs composition follows Miyamoto and Cho (2016) as it gives us the same perplexity results for our preliminary experiments on the Penn Treebank dataset (Marcus et al., 1993), preprocessed by Mikolov et al. (2010).

Training and Evaluation
Our LSTM-LM uses two hidden layers with 200 hidden units and representation vectors for words, characters, and morphs all have dimension 200. All parameters are initialized uniformly at random from -0.1 to 0.1, trained by stochastic gradient descent with mini-batch size of 32, time steps of 20, for 50 epochs. To avoid overfitting, we apply dropout with probability 0.5 on the input-tohidden layer and all of the LSTM cells (including those in the bi-LSTM, if used). For all models which do not use bi-LSTM composition, we start with a learning rate of 1.0 and decrease it by half if the validation perplexity does not decrease by 0.1 after 3 epochs. For models with bi-LSTMs composition, we use a constant learning rate of 0.2 and stop training when validation perplexity does not improve after 3 epochs. For the character CNN model, we use the same settings as the small model of Kim et al. (2016).
To make our results comparable to , for each language we limit the output vocabulary to the most frequent 5,000 training words plus an unknown word token. To learn to predict unknown words, we follow : in training, words that occur only once are stochastically replaced with the unknown token with probability 0.5. To evaluate the models, we compute perplexity on the test data. Table 5 presents our main results. In six of ten languages, character-trigram representations composed with bi-LSTMs achieve the lowest perplexities. As far as we know, this particular model has not been tested before, though it is similar  to (but more general than) the model of Sperr et al. (2013). We can see that the performance of character, character trigrams, and BPE are very competitive. Composition by bi-LSTMs or CNN is more effective than addition, except for Turkish. We also observe that BPE always outperforms Morfessor, even for the agglutinative languages.

Results and Analysis
We now turn to a more detailed analysis by morphological typology.
Fusional languages. For these languages, character trigrams composed with bi-LSTMs outperformed all other models, particularly for Czech and Russian (up to 20%), which is unsurprising since both are morphologically richer than English.
Agglutinative languages. We observe different results for each language. For Finnish, character trigrams composed with bi-LSTMs achieves the best perplexity. Surprisingly, for Turkish character trigrams composed via addition is best and addition also performs quite well for other representations, potentially useful since the addition function is simpler and faster than bi-LSTMs. We suspect that this is due to the fact that Turkish morphemes are reasonably short, hence wellapproximated by character trigrams. For Japanese, we improvements from character models are more modest than in other languages.
Root and Pattern. For these languages, character trigrams composed with bi-LSTMs also achieve the best perplexity.
We had wondered whether CNNs would be more effective for root-and-pattern morphology, but since these data are unvocalized, it is more likely that nonconcatenative effects are minimized, though we do still find morphological variants with consonantal inflections that behave more like concatenation. For example, maktab (root:ktb) is written as mktb. We suspect this makes character trigrams quite effective since they match the tri-consonantal root patterns among words which share the same root.
Reduplication. For Indonesian, BPE morphs composed with bi-LSTMs model obtain the best perplexity. For Malay, the character CNN outperforms other models. However, these improvements are small compared to other languages. This likely reflects that Indonesian and Malay are only moderately inflected, where inflection involves both concatenative and non-concatenative processes.

Effects of Morphological Analysis
In the experiments above, we used unsupervised morphological segmentation as a proxy for morphological analysis (Table 3). However, as discussed in Section 2, this is quite approximate, so it is natural to wonder what would happen if we had the true morphological analysis. If characterlevel models are powerful enough to capture the effects of morphology, then they should have the predictive accuracy of a model with access to this analysis. To find out, we conducted an oracle experiment using the human-annotated morphological analyses provided in the UD datasets for Czech and Russian, the only languages in our set for which these analyses were available. In these experiments we treat the lemma and each morphological feature as a subword unit.
other models for both languages. These results demonstrate that neither character representations nor unsupervised segmentation is a perfect replacement for manual morphological analysis, at least in terms of predictive accuracy. In light of character-level results, they imply that current unsupervised morphological analyzers are poor substitutes for real morphological analysis. However, we can obtain much more unannotated than annotated data, and we might guess that the character-level models would outperform those based on morphological analyses if trained on larger data. To test this, we ran experiments that varied the training data size on three representation models: word, character-trigram bi-LSTM, and character CNN. Since we want to see how much training data is needed to reach perplexity obtained using annotated data, we use the same output vocabulary derived from the original training. While this makes it possible to compare perplexities across models, it is unfavorable to the models trained on larger data, which may focus on other words. This is a limitation of our experimental setup, but does allow us to draw some tentative conclusions. As shown in Table 7, a characterlevel model trained on an order of magnitude more data still does not match the predictive accuracy of a model with access to morphological analysis.

Automatic Morphological Analysis
The oracle experiments show promising results if we have annotated data. But these annotations are expensive, so we also investigated the use of automatic morphological analysis. We obtained analyses for Arabic with the MADAMIRA (Pasha et al., 2014). 4 As in the experiment using annotations, we treated each morphological feature as a subword unit. The resulting perplexities of 71.94 and 42.85 for addition and bi-LSTMs, respectively, are worse than those obtained with character trigrams (39.87), though it approaches the best perplexities.  Table 7: Perplexity results on the Czech development data, varying training data size. Perplexity using~1M tokens annotated data is 28.83.

Targeted Perplexity Results
A difficulty in interpreting the results of Table 5 with respect to specific morphological processes is that perplexity is measured for all words. But these processes do not apply to all words, so it may be that the effects of specific morphological processes are washed out. To get a clearer picture, we measured perplexity for only specific subsets of words in our test data: specifically, given target word w i , we measure perplexity of word w i+1 .
In other words, we analyze the perplexities when the inflected words of interest are in the most recent history, exploiting the recency bias of our LSTM-LM. This is the perplexity most likely to be strongly affected by different representations, since we do not vary representations of the predicted word itself. We look at several cases: nouns and verbs in Czech and Russian, where word classes can be identified from annotations, and reduplication in Indonesian, which we can identify mostly automatically. For each analysis, we also distinguish between frequent cases, where the inflected word occurs more than ten times in the training data, and rare cases, where it occurs fewer than ten times. We compare only bi-LSTM models.
For Czech and Russian, we again use the UD annotation to identify words of interest. The results (Table 8), show that manual morphological analysis uniformly outperforms other subword models, with an especially strong effect for Czech nouns, suggesting that other models do not capture useful predictive properties of a morphological analysis. We do however note that character trigrams achieve low perplexities in most cases, similar to overall results (Table 5). We also observe that the subword models are more effective for rare words.  Table 8: Average perplexities of words that occur after nouns and verbs. Frequent words occur more than ten times in the training data; rare words occur fewer times than this. The best perplexity is in bold while the second best is underlined.
For Indonesian, we exploit the fact that the hyphen symbol '-' typically separates the first and second occurrence of a reduplicated morpheme, as in the examples of Section 2. We use the presence of word tokens containing hyphens to estimate the percentage of those exhibiting reduplication. As shown in Table 9, the numbers are quite low. Table 10 shows results for reduplication. In contrast with the overall results, the BPE bi-LSTM model has the worst perplexities, while character bi-LSTM has the best, suggesting that these models are more effective for reduplication.
Looking more closely at BPE segmentation of reduplicated words, we found that only 6 of 252 reduplicated words have a correct word segmentation, with the reduplicated morpheme often combining differently with the notional start-of-word or hyphen character. One the other hand BPE correctly learns 8 out of 9 Indonesian prefixes and 4 out of 7 Indonesian suffixes. 5 This analysis supports our intuition that the improvement from BPE might come from its modeling of concatenative morphology. Table 11 presents nearest neighbors under cosine similarity for in-vocabulary, rare, and out-of- 5 We use Indonesian affixes listed in Larasati et al. (2011) Language type-level (%) token-level (%) Indonesian 1.10 2.60 Malay 1.29 2.89  vocabulary (OOV) words. 6 For frequent words, standard word embeddings are clearly superior for lexical meaning. Character and morph representations tend to find words that are orthographically similar, suggesting that they are better at modeling dependent than root morphemes. The same pattern holds for rare and OOV words. We suspect that the subword models outperform words on language modeling because they exploit affixes to signal word class. We also noticed similar patterns in Japanese. We analyze reduplication by querying reduplicated words to find their nearest neighbors using the BPE bi-LSTM model. If the model were sensitive to reduplication, we would expect to see morphological variants of the query word among its nearest neighbors. However, from Table 12, this is not so. With the partially reduplicated query berlembah-lembah, we do not find the lemma lembah.

Conclusion
We presented a systematic comparison of word representation models with different levels of morphological awareness, across languages with different morphological typologies. Our results confirm previous findings that character-level models are effective for many languages, but these models do not match the predictive accuracy of model with explicit knowledge of morphology, even after we increase the training data size by ten times. Moreover, our qualitative analysis suggests that they learn orthographic similarity of affixes, and lose the meaning of root morphemes.
Although morphological analyses are available  Table 11: Nearest neighbours of semantically and syntactically similar words.

Query
Top nearest neighbours kota-kota wilayah-wilayah (areas), pulau-pulau (islands), negara-negara (countries), (cities) bahasa-bahasa (languages), koloni-koloni (colonies) berlembah-lembah berargumentasi (argue), bercakap-cakap (converse), berkemauan (will), (have many valleys) berimplikasi (imply), berketebalan (have a thickness) in limited quantities, our results suggest that there might be utility in semi-supervised learning from partially annotated data. Across languages with different typologies, our experiments show that the subword unit models are most effective on agglutinative languages. However, these results do not generalize to all languages, since factors such as morphology and orthography affect the utility of these representations. We plan to explore these effects in future work.