Character-Aware Neural Morphological Disambiguation

We develop a language-independent, deep learning-based approach to the task of morphological disambiguation. Guided by the intuition that the correct analysis should be “most similar” to the context, we propose dense representations for morphological analyses and surface context and a simple yet effective way of combining the two to perform disambiguation. Our approach improves on the language-dependent state of the art for two agglutinative languages (Turkish and Kazakh) and can be potentially applied to other morphologically complex languages.


Introduction
Morphological disambiguation (MD) is a long standing problem in processing morphologically complex languages (MCL). POS tagging is a somewhat related problem, however in MD, in addition to POS tags, one typically has to predict lemmata (roots hereinafter) that surface forms stem from and morphemes 1 they bear. For example, depending on the context, a Turkish word adam can be analyzed as: (i) a man - [adam] (Hakkani-Tür et al., 2002). Thus, if one counts analyses as tags, MD can be cast as a tagging problem with an extremely large tagset. This fact discourages direct application of the state of the art approaches designed for small fixed tagsets.
To develop a language independent dense representation of the analyses, we segment 2 an analysis 1 We use the term morpheme for its universal recognition within the community. A more appropriate term might be grammeme, i.e. a value of grammatical category.
2 Such a segmentation is denoted by the squared brackets numbered in the respective order (cf. Turkish example). into (i) the root, (ii) its POS and (iii) the morpheme chain (MC). We then proceed to jointly learn the embeddigns for the root and the POS segments and to combine them and the MC segment representation into a single dense representation. MC segments are represented as binary vectors that, for a given analysis, encode presence or absence of each morpheme found in the train set. This ensures language independence and contrasts previous work (at least on Turkish and Kazakh), where only certain morphemes are chosen as features depending on their position (Assylbekov et al., 2016;Hakkani-Tür et al., 2002) or presence (Makhambetov et al., 2015) in an analysis, or the authors' intuition (Yildiz et al., 2016;Tolegen et al., 2016;Sak et al., 2007).
Apart from the sparseness of analyses distribution MCL notoriously raise free word order and long dependency issues. Thus, decoding analysis sequences using only the leftmost context may not be enough. To address this we leverage the rightmost context as well. We model the left-and rightmost surface context in two ways: using (i) BiL-STM (Greff et al., 2015) with a character-based sub-layer (Ling et al., 2015) and (ii) with a feed forward network on word embeddings. We then entertain the idea that given a word with multiple analyses and its surface context, the correct analysis might be "closer" to the context. Following our intuition, we have tried computing the distance between the analysis and the context representations, and a simple dot product (as in unnormalized cosine similarity) has yielded the best performance.
We evaluate our approach on Turkish and Kazakh data sets, using several baselines (including the state of the art methods for both languages) and a variety of settings and metrics. In terms of general accuracy our approach has achieved a nearly 1% improvement over the state of the art for Turkish and a marginal improvement for Kazakh.
Our contribution amounts to the following: (i) a general MD framework for MCL that can be analyzed in <root, POS, MC> triplets; (ii) improvement on language-dependent state of the art for Turkish and Kazakh.

Models
In this section we describe our approach to encoding morphological analyses and the context into the embeddings and combining them to perform morphological disambiguation.

Morphological Representation
We treat a morphological analysis as a combination of three main constituents: the root, its POS and the morpheme chain. These constituents are represented as d r , d p , and d m -dimensional vectors respectively. The former two vectors correspond to dense word embeddings (Collobert et al., 2011), and the latter is a binary vector which encodes the presence of a certain morpheme in the chain. The size of the binary vector, d m , is equal to the size of the morpheme dictionary obtained from the data.
Given a sentence and the j-th surface word form with N analyses, we represent the k-th analysis as: where A j ∈ R d h ×|N | , d h is the dimension of each analysis embedding, r k ∈ R dr×1 , p k ∈ R dp×1 , m k ∈ {0, 1} dp×1 are constituent vectors of the k-th analysis, and W r ∈ R d h ×dr , W p ∈ R d h ×dp ,W m ∈ R d h ×dm are the model parameters. The bias term was left out for clarity. This representation is shown on Figure 1 (bottom).

Recurrent Neural Disambiguation
The model architecture is shown on Figure 1. It consists of two main blocks that learn the surface context (top) and the morphological analyses representations (bottom).
When it comes to modeling context via word embeddings for morphologically complex languages, it is impractical to actually store vectors for all words, since majority of words in such languages has a large number of surface realizations. Our solution to this problem is to construct a surface word representation from characters that not only reduces data sparseness, but also help in dealing with the out-of-vocabulary (OOV) words (Ling et al., 2015). We represent each character of each word as a vector x i ∈ R dc and the Hidden layer: h c (j) Figure 1: Model architecture entire embedding matrix as E c ∈ R dc×|C| , where C is the character vocabulary extracted from the training set including alphanumeric characters and other possible symbols. Given an input surface word w i with its character embeddings x 1 , ..., x n , the hidden state h t at the time step t can be computed via the following Vanilla LSTM calculations: where σ(·) and tanh(·) are the non-linear functions. i t , f t , o t are referred to three gates: input, forget, output that control the information flow of inputs. Parameters of the LSTM are W * , U * , b * , where * can be any of {i, f, o, g}. The peephole connections were left out for clarity. We use both forward and backward LSTM to learn word representations obtained by concatenation of the last states in both direction h f and h b , e.g. h w = h f , h b . Character-based word embeddings obtained in this manner do not yet con-tain the context information on a sentence level. Thus, we adopt another LSTM to learn contextsensitive information for each word in both directions. We denote the concatenation of the embeddings learned from the forward and backward LSTM states as h c (j) ∈ R 2hs×1 (where hs is the output size for the j-th word), and represent surface context as: For the final prediction, we score each analysis by computing the inner product between its representation and the context's representations: where S j and A k j are computed as equations (8)  and (1) respectively. We normalize the obtained scores using softmax and choose the analysis with the maximum score as the correct one. In what follows we refer to this model as BiLSTM.
Finally, in a separate setting, in addition to the surface context in the hidden layer we also incorporate the immediate (left and right) morphological context in the form of the average of the analyses representations: where L j ∈ R 2d h ×1 is concatenation of averaged representations of the leftmost and rightmost analyses, and D c ∈ R d h ×2hs and D a ∈ R d h ×2d h are the model parameters. This advanced variation is referred to as BiLSTM*.

Alternative Context Representation
We also experiment with an alternative context model that uses a feed-forward NN architecture (Collobert et al., 2011;Zheng et al., 2013). In this model word embeddings of fixed window size are fed to the hidden layer, and the output represents the context. The remaining parts of the architecture stay the same: we use the same morphological representation and choose the correct analysis exactly as we did for BiLSTM model. As in the case with BiLSTM, we leverage morphological context, by performing a Viterbi decoding conditioned on the leftmost analysis. We refer to this   NN), an advanced variation of which uses the averaged rightmost morphological context as well, and is referred to as DNN*.

Training
In all models, the top layer of the networks has a softmax that computes the normalized scores over morphological candidates given the input word. The networks are trained to minimize the cross entropy of the predicted and true morphological analyses. Back-propagation is employed to compute the gradient of the corresponding object function with respect to the model parameters.

Data Sets
We conduct our experiments on Kazakh (Assylbekov et al., 2016) and Turkish (Yuret and Türe, 2006) data sets 3 . Table 1 shows the corpora statistics. Kazakh data set is almost 50 times smaller than that of Turkish, with four times the OOV rate and almost twice as many analyses per word on average. Given such a drastic difference in the resources it would be interesting to see how our models perform on otherwise similar languages (both Turkic). Lastly, while the corpora provide train and test splits, there are no tuning sets, so we withdraw small portions from the training sets for tuning hyper-parameters 4 .

Baselines
We compare our models to three other approaches. For Kazakh we use an HMM based tagger and its version extended with the rule-based constraint grammar (Assylbekov et al., 2016), which is considered the state of the art for the language. We  refer to these baselines as HMM and HMMCG. Another baseline is a voted perceptron (Collins, 2002) based tagger. We use our implementation of this baseline for Kazakh and the model developed by Sak et al. (2007) for Turkish. Lastly, we use a neural network model proposed by Yildiz et al. (2016), which is considered state of the art for Turkish. For this baseline too we use our own implementation (for both languages) and refer to it as MANN 5 .

Experimental Setup
As described in the previous section, each of our models has two settings: the one that does not incorporate surrounding morphological context and the one that does (the starred one). In addition to that we use pre-trained embeddings, by training word2vec (Mikolov et al., 2013) skip-gram model on Wikipedia texts. This setting is denoted by a double dagger ( ‡). We perform a single run evaluation in terms of token-and sentence-based accuracy. We consider four types of tokens: (i) all tokens; (ii) ambiguous tokens (the ones with at least two analyses); (iii) OOV tokens; (iv) ambiguous OOV tokens. Thus, we use a total of five metrics. In terms of strictness we deem correct only the predictions that match the golden truth completely, i.e. in root, POS and MC (up to a single morpheme tag). 5 Note that all of the baselines are language dependent to a certain degree, with MANN being the least dependent and HMMCG the most. The latter baseline employs handengineered constraint grammar rules to perform initial disambiguation, followed by application of the HMM tagger, which cherry-picks the most informative grammatical features.

Results and Discussion
The results are given in Table 2. Unless stated otherwise we refer to the general (all tokens) accuracy when comparing model performances.
For Kazakh, DNN conditioned on the leftmost analysis yields 86.33% accuracy. DNN* that in addition uses the rightmost analysis embeddings, improves almost 1% over that result (87.25%). On the other hand BiLSTM, whose context representation uses surface forms only, performs even better (87.49%). When this model incorporates immediate morphological context, it (BiLSTM*) performs at 90.92% and beats the HMMCG baseline. However, the latter being a very strong language dependent baseline still outperforms our model in ambiguous OOV and sentence accuracy. When we evaluate our model under equal conditions (BiLSTM* ‡+CG) it beats HMMCG on all of the metrics. We separate this comparison from the rest because of a language-dependent set up.
In contrast, for Turkish DNN models outperform BiLSTM on seen tokens and yield an almost equal 92.2% accuracy regardless of using the rightmost morphological context. This performance is also higher than that of all baselines, including the state of the art MANN. However BiL-STM* is still better than DNN* in OOV token accuracy, both overall and ambiguous.
As it can be seen, pre-training boosts the performance of DNN* and BiLSTM* across all metrics. For Kazakh pre-training results in .14% improvement in general token accuracy for BiL-STM*, which amounts to .67% improvement over the state of the art. For Turkish this results in an almost 1% net improvement in overall token accuracy over MANN, the state of the art 6 .
A cross-linguistic comparison reveals that although Kazakh data set is much smaller than that of Turkish and has more analyses per word on average and higher OOV rate, on certain metrics the models perform on par or even better for Kazakh 7 . To investigate this further we have made data sets comparable in size by randomly choosing 20.6K+ and 3.4K from Turkish training and test sets. On this data BiLSTM* ‡ yields 91.18, 82.0% general and ambiguous token accuracy and respective scores for OOV are 87.0, 74.6%. This result follows the pattern, where for Turkish only the general accuracy is higher than that of Kazakh. It turns out that Turkish data contains many unambiguous tokens: 49% and 48% for full and small data sets (train + test average), against 36% for Kazakh. This suggests that the higher general accuracy on Turkish data can be explained by the higher rate of the unambiguous tokens. Also Turkish has a more complex derivational morphology, which "lengthens" the analyses, e.g. an average number of morphemes per analysis is higher for Turkish (5.25) than for Kazakh (4.6). This adds sparseness to the morpheme chains and certainly further complicates disambiguation, especially in an OOV scenario.
We also observe that BiLSTM* ‡ works best on all metrics for Kazakh, but for Turkish it beats DNN* ‡ only on the OOV part. Due to BiLSTM* ‡ being computationally prohibitive we ran it with significantly less number of epochs than DNN, and it also being a character-based model, we speculate that it was able to learn character aware context embeddings hence better at OOV.

Related Work
A morphology-aware NN (MANN) for MD was proposed by Yildiz et al. (2016), and has been reported to achieve ambiguous token accuracies of 84.12, 88.35 and 93.78% for Turkish, Finish and Hungarian respectively. This approach differs from ours in a number of ways. (i) Our analysis representation treats morpheme tags in a language-independent manner considering every tag found in the training set, whereas in MANN certain tags are chosen with a specific language in mind. (ii) MANN is a feed-forward NN that, unlike our approach, does not account for the surface context. (iii) As we understood, at the decoding step MANN makes use of the golden truth, whereas our models have no need for that.
Although several statistical models have been proposed for Kazakh MD, such as HMM- (Makazhanov et al., 2014;Makhambetov et al., 2015;Assylbekov et al., 2016), voted perceptron- (Tolegen et al., 2016) and transformation-based (Kessikbayeva and Cicekli, 2016) taggers, to our knowledge ours is the first deep learning-based approach to the problem that is also purely language independent.
It is becoming increasingly popular to use richer architectures to learn better embeddings from characters/words (Yessenbayev and Makazhanov, 2016;Ling et al., 2015;Wieting et al., 2016). Ling et al. (2015) used a BiLSTM to learn word vectors, showing strong performance on language modeling and POS tagging. Melamud et al. (2016) proposed context2vec, a BiLSTM based model to learn context embedding of target words and achieved state-of-the-art results on sentence completion and word sense disambiguation.

Conclusion
We have proposed a general MD framework for MCL that can be analyzed in <root, POS, MC> triplets. We have showed that the surface context can be useful to MD, especially if combined with morphological context. Our next step would be to assess our claims on a larger number of typologically distant languages.