Context-Aware Prediction of Derivational Word-forms

Derivational morphology is a fundamental and complex characteristic of language. In this paper we propose a new task of predicting the derivational form of a given base-form lemma that is appropriate for a given context. We present an encoder-decoder style neural network to produce a derived form character-by-character, based on its corresponding character-level representation of the base form and the context. We demonstrate that our model is able to generate valid context-sensitive derivations from known base forms, but is less accurate under lexicon agnostic setting.


Introduction
Understanding how new words are formed is a fundamental task in linguistics and language modelling, with significant implications for tasks with a generation component, such as abstractive summarisation and machine translation. In this paper we focus on modelling derivational morphology, to learn, e.g., that the appropriate derivational form of the verb succeed is succession given the context As third in the line of . . . , but is success in The play was a great . English is broadly considered to be a morphologically impoverished language, and there are certainly many regularities in morphological patterns, e.g., the common usage of -able to transform a verb into an adjective, or -ly to form an adverb from an adjective. However there is considerable subtlety in English derivational morphology, in the form of: (a) idiosyncratic derivations; e.g. picturesque vs. beautiful vs. splendid as adjectival forms of the nouns picture, beauty and splendour, respectively; (b) derivational generation in context, which requires the automatic determination of the part-of-speech (POS) of the stem and the likely POS of the word in context, and POS-specific derivational rules; and (c) multiple derivational forms often exist for a given stem, and these must be selected between based on the context (e.g. success and succession as nominal forms of success, as seen above). As such, there are many aspects that affect the choice of derivational transformation, including morphotactics, phonology, semantics or even etymological characteristics. Earlier works (Thorndike, 1941) analysed ambiguity of derivational suffixes themselves when the same suffix might present different semantics depending on the base form it is attached to (cf. beautiful vs. cupful). Furthermore, as Richardson (1977) previously noted, even words with quite similar semantics and orthography such as horror and terror might have non-overlapping patterns: although we observe regularity in some common forms, for example, horrify and terrify, and horrible and terrible, nothing tells us why we observe terrorize and no instances of horrorize, or horrid, but not terrid.
In this paper, we propose the new task of predicting a derived form from its context and a base form. Our motivation in this research is primarily linguistic, i.e. we measure the degree to which it is possible to predict particular derivation forms from context. A similar task has been proposed in the context of studying how children master derivations (Singson et al., 2000). In their work, children were asked to complete a sentence by choosing one of four possible derivations. Each derivation corresponded either to a noun, verb, adjective, or adverbial form. Singson et al. (2000) showed that childrens' ability to recognize the correct form correlates with their reading ability. This observation confirms an earlier idea that orthographical regularities provide a clearer clues to morphological transformations comparing to phonological rules (Templeton, 1980;Moskowitz, 1973), especially in lan-guages such as English where grapheme-phoneme correspondences are opaque. For this reason we consider orthographic rather than phonological representations.
In our approach, we test how well models incorporating distributional semantics can capture derivational transformations. Deep learning models capable of learning real-valued word embeddings have been shown to perform well on a range of tasks, from language modelling (Mikolov et al., 2013a) to parsing (Dyer et al., 2015) and machine translation (Bahdanau et al., 2015). Recently, these models have also been successfully applied to morphological reinflection tasks (Kann and Schütze, 2016;Cotterell et al., 2016a).

Derivational Morphology
Morphology, the linguistic study of the internal structure of words, has two main goals: (1) to describe the relation between different words in the lexicon; and (2) to decompose words into morphemes, the smallest linguistic units bearing meaning. Morphology can be divided into two types: inflectional and derivational. Inflectional morphology is the set of processes through which the word form outwardly displays syntactic information, e.g., verb tense. It follows that an inflectional affix typically neither changes the part-of-speech (POS) nor the semantics of the word. For example, the English verb to run takes various forms: run, runs and ran, all of which convey the concept "moving by foot quickly", but appear in complementary syntactic contexts.
Derivation, on the other hand, deals with the formation of new words that have semantic shifts in meaning (often including POS) and is tightly intertwined with lexical semantics (Light, 1996). Consider the example of the English noun discontentedness, which is derived from the adjective discontented. It is true that both words share a close semantic relationship, but the transformation is clearly more than a simple inflectional marking of syntax. Indeed, we can go one step further and define a chain of words content → contented → discontented → discontentedness.
In this work, we deal with the formation of deverbal nouns, i.e., nouns that are formed from verbs. Common examples of this in English include agentives (e.g., explain → explainer), gerunds (e.g., explain → explaining), as well as other nominalisations (e.g., explain → explanation). Nominal- isations have varyingly different meanings from their base verbs, and a key focus of this study is the prediction of which form is most appropriate depending on the context, in terms of syntactic and semantic concordance. Our model is highly flexible and easily applicable to other related lexical problems.

Related Work
Although in the last few years many neural morphological models have been proposed, most of them have focused on inflectional morphology (e.g., see Cotterell et al. (2016a)). Focusing on derivational processes, there are three main directions of research. The first deals with the evaluation of word embeddings either using a word analogy task (Gladkova et al., 2016) or binary relation type classification (Vylomova et al., 2016). In this context, it has been shown that, unlike inflectional morphology, most derivational relations cannot be as easily captured using distributional methods. Researchers working on the second type of task attempt to predict derived forms using the embedding of its corresponding base form and a vector encoding a "derivational" shift. Guevara (2011) notes that derivational affixes can be modelled as a geometrical function over the vectors of the base forms. On the other hand, Lazaridou et al. (2013) and Cotterell and Schütze (2017) represent derivational affixes as vectors and investigate various functions to combine them with base forms. Kisselew et al. (2015) and Padó et al. (2016) extend this line of research to model derivational morphology in German. This work demonstrates that various factors such as part of speech, semantic regularity and argument structure (Grimshaw, 1990) influence the predictability of a derived word. The third area of research focuses on the analysis of derivationally complex forms, which differs from this study in that we focus on generation. The goal of this line of work is to produce a canonicalised segmentation of an input word into its constituent morphs, e.g., unhappiness →un+happy+ness (Cotterell et al., 2015;Cotterell et al., 2016b). Note that the orthographic change y →i has been reversed.

Dataset
As the starting point for the construction of our dataset, we used the CELEX English dataset (Baayen et al., 1993). We extracted verb-noun lemma pairs from CELEX, covering 24 different nominalisational suffixes and 1,456 base lemmas. Suffixes only occurring in 5 or fewer lemma pairs mainly corresponded to loan words and consequently were filtered out. We augmented this dataset with verb-verb pairs, one for each verb present in the verb-noun pairs, to capture the case of a verbal form being appropriate for the given context. 1 For each noun and verb lemma, we generated all their inflections, and searched for sentential contexts of each inflected token in a pre-tokenised dump of English Wikipedia. 2 To dampen the effect of high-frequency words, we applied a heuristic log function threshold which is basically a weighted logarithm of the number of the contexts. The final dataset contains 3,079 unique lemma pairs represented in 107,041 contextual instances. 3

Experiments
In this paper we model derivational morphology as a prediction task, formulated as follows. We take sentences containing a derivational form of a given lemma, then obscure the derivational form by replacing it with its base form lemma. The system must then predict the original (derivational) form, which may make use of the sentential context. System predictions are judged correct if they exactly match the original derived form.

Baseline
As a baseline we considered a trigram model with modified Kneser-Ney smoothing, trained on the training dataset. Each sentence in the testing data was augmented with a set of confabulated sentences, where we replaced a target word with other its derivations or a base form. Unlike the general task, where we generate word forms as character sequences, here we use a set of known inflected forms for each lemma (from the training data). We then use the language model to score the collections of test sentences, and selected the variant with the highest language model score, and evaluate accuracy of selecting the original word form.

Encoder-Decoder Model
We propose an encoder-decoder model. The encoder combines the left and the right contexts as well as a character-level base form representation: correspond to the last hidden states of an LSTM (Hochreiter and Schmidhuber, 1997) over left and right contexts and the character-level representation of the base form (in each case, applied forwards and backwards), respectively; H ∈ R [h×l×1.5,h×l×6] is a weight matrix, and b h ∈ R [h×l×1.5] is a bias term.
[; ] denotes a vector concatenation operation, h is the hidden state dimensionality, and l is the number of layers.
Next we add an extra affine transformation, o = T · t + b o , where T ∈ R [h×l×1.5,h×l] and b o ∈ R [h×l] , then o is then fed into the decoder: where c j is an embedding of the j-th character of the derivation, l j+1 is an embedding of the corresponding base character, B, S, R are weight matrices, and b d is a bias term.
We now elaborate on the design choices behind the model architecture which have been tailored to our task. We supply the model with the l j+1 character prefix of the base word to enable a copying mechanism, to bias the model to generate a derived form that is morphologically-related to the base verb. In most cases, the derived form is longer than its stem, and accordingly, when we reach the end of the base form, we continue to input an end-of-word symbol. We provide the model with the context vector o at each decoding step. It has been previously shown (Hoang et al., 2016) that this yields better results than other means of incorporation. 4 Finally, we use max pooling to enable the model to switch between copying of a stem or producing a new character.

Settings
We used a 3-layer bidirectional LSTM network, with hidden dimensionality h for both context and base-form stem states of 100, and character embedding c j of 100. 5 We used pre-trained 300-dimensional Google News word embeddings (Mikolov et al., 2013a;Mikolov et al., 2013b). During the training of the model, we keep the word embeddings fixed, for greater applicability to unseen test instances. All tokens that didn't appear in this set were replaced with UNK sentinel tokens. The network was trained using SGD with momentum until convergence.

Results
With the encoder-decoder model, we experimented with the encoder-decoder as described in Section 5.2 ( "biLSTM+CTX+BS"), as well as several variations, namely: excluding context information ("biLSTM+BS"), and excluding the bidirectional stem ("biLSTM+CTX"). We also investigated how much improvement we can get from knowing the POS tag of the derived form, by presenting it explicitly to the model as extra conditioning context ("biLSTM+CTX+BS+POS"). The main motivation for this relates to gerunds, where without the We then tried a single-directional context representation, by using only the last hidden states, i.e., h → left and h ← right , corresponding to the words to the immediate left and right of the wordform to be predicted ("LSTM+CTX+BS+POS").
We ran two experiments: first, a shared lexicon experiment, where every stem in the test data was present in the training data; and second, using a split lexicon, where every stem in the test data was unseen in the training data. The results are presented in Table 1, and show that: (1) context has a strong impact on results, particularly in the shared lexicon case; (2) there is strong complementarity between the context and character representations, particularly in the split lexicon case; and (3) POS information is particularly helpful in the split lexicon case. Note that most of the models significantly outperform our baseline under shared lexicon setting. The baseline model doesn't support the split lexicon setting (as the derivational forms of interest, by definition, don't occur in the training data), so we cannot generate results in this setting.

Error Analysis
We carried out error analysis over the produced forms of the LSTM+CTX+BS+POS model. First, the model sometimes struggles to differentiate between nominal suffixes: in some cases it puts an agentive suffix (-er or -or) in contexts where a nonagentive nominalisation (e.g. -ation or -ment) is appropriate. As an illustration of this, Figure 2 is a t-SNE projection of the context representations for simulate vs. simulator vs. simulation, showing that the different nominal forms have strong overlap. Secondly, although the model learns whether to copy or produce a new symbol well, some forms are spelled incorrectly. Examples of this are studint, studion or even studyant rather than student as the agentive nominalisation of study. Here, the issue is opaqueness in the etymology, with student being borrowed from the Old French estudiant. For transformations which are native to English, for example, -ate → -ation, the model is much more accurate. Table 2 shows recall values achieved for various suffix types. We do not present precision since it could not be reliably estimated without extensive manual analysis.
In the split lexicon setting, the model sometimes misses double consonants at the end of words, producing wraper and winer and is biased towards generating mostly productive suffixes. An example of the last case might be stoption in place of stoppage. We also studied how much the training size affects the model's accuracy by reducing the data from 1,000 to 60,000 instances (maintaining a balance over lemmas). Interestingly, we didn't observe a significant reduction in accuracy. Finally, note that under the split lexicon setting, the model is agnostic of existing derivations, sometimes overgenerating possible forms. A nice illustration of that is trailation, trailment and trailer all being produced in the contexts of trailer. In other cases, the model might miss some of the derivations, for instance, predicting only government in the contexts of governance and government. We hypothesize that it is either due to very subtle differences in their contexts, or the higher productivity of -ment.
Finally, we experimented with some nonsense stems, overwriting sentential instances of transcribe to generate context-sensitive derivational forms. Table 3 presents the nonsense stems, the correct form of transcribe for a given context, and the predicted derivational form of the nonsense word. Note that the base form is used correctly (top row) for three of the four nonsense words, and that despite the wide variety of output forms, they resemble plausible words in English. By looking at a larger slice of the data, we observed some regularities. For instance, fapery was mainly produced in the contexts of transcript whereas fapication was more related to transcription. Table 3 also shows that some of the stems appear to be more productive than others.   Table 3: An experiment with nonsense "target" base forms generated in sentence contexts of the "original" word transcribe

Conclusions and Future Work
We investigated the novel task of context-sensitive derivation prediction for English, and proposed an encoder-decoder model to generate nominalisations. Our best model achieved an accuracy of 90% on a shared lexicon, and 66% on a split lexicon. This suggests that there is regularity in derivational processes and, indeed, in many cases the context is indicative. As we mentioned earlier, there are still many open questions which we leave for future work. Further, we plan to scale to other languages and augment our dataset with Wiktionary data, to realise much greater coverage and variety of derivational forms.