IPS-WASEDA system at CoNLL–SIGMORPHON 2018 Shared Task on morphological inflection

This paper presents the system submitted by IPS-WASEDA University for CoNLL– SIGMORPHON 2018 Shared Task 1: Type level inﬂection. We develop a system based on a holistic approach which considers whole-word form as a unit, instead of breaking them into smaller pieces (e,g. morphemes) like the baseline systems does. We also implement an encoder-decoder model which has recently become the new standard in many natural language processing (NLP) tasks. The results show that the neural approach outperforms the baseline and our holistic approach on bigger resources settings. The use of data augmentation helps to improve the performance of the model in lower resources settings, although it still cannot beat the other systems. In the end, for the low resources setting, our holistic approach performs best in comparison to the baseline and the neural approach (even with data augmentation).

We address the problem of inflection task: given a lemma (e.g. the dictionary form of a word) and the target form's morphosyntactic descriptions (MSD), generate a target inflected form. Figure 1 shows an example of inflection task in English.
Many NLP tasks, like machine translation, require analysis and generation of morphological word forms, even previously unseen ones. Different languages exhibit different richness of morphology. This makes the task an interesting prob-lem. Dreyer and Eisner (2011) show that data sparsity is a common issue for language with rich morphology which usually leads to poor generalisations in machine learning.
There are three main approaches at the problem: • The hand-engineered rule-based approach offers a high accuracy but costs time during construction. It usually faces the world coverage problem and is language-dependent.
• The supervised approach automatically induces the rules from a given training data and applies the best rules to generate the target forms by using some classification techniques (Ahlberg et al., 2015). It is practically language independent and relatively easier to build. However, the data sparsity is an issue.
• The neural approach is the model which triumphed in the task recently, especially the RNN encoder-decoder model (Kann and Schütze, 2016;Makarov et al., 2017). Some drawbacks of this approach are very long training times and the need for a large amount of training data.
This paper describes the systems we developed for the CoNLL-SIGMORPHON 2018 Shared Task 1 (Cotterell et al., 2018). The recent success of neural approach encouraged us to implement a sequence-to-sequence (seq2seq) model to solve the task. Knowing that the neural approach tends to need a large amount of training data, we also consider another approach as a back-off, which is a holistic approach. We treat the task of generating target forms as the task of solving analogical equations between words. • dev: this dataset is given to evaluate the performance of our system during the development phase. It consists of 1,000 word forms.

Languages and data
• test: this dataset is given at the test phase. This dataset does not contain the target forms. It consists of 1,000 word forms, similar to dev dataset.
For some languages, the size of the dataset is smaller than the one mentioned above.
Let us now look at some statistics on the given dataset shown in Table 1. Overall, we can observe a non-decreasing phenomenon from low to high for all of the number of pieces of information (features) found in the training dataset. On the opposite, we found a non-increasing pattern for the unseen information contained in the dev dataset relatively to training dataset. This shows that bigger resources gradually cover the unseen data encountered in the smaller ones.
Norman, Telugu, Cornish, and Uzbek are languages with a smaller number of lemmata in the training dataset. However, these languages tend to have less, even zero for some languages, unseen lemmata relatively to the dev dataset. They also have a smaller number of unseen characters. On the other hand, languages like Finnish, Russian, English, French, and German have the biggest number of unseen lemmata despite having the biggest number of lemmata in the training dataset compared to other languages.
Let us now turn to the number of MSDs and MSD patterns. These numbers can be interpreted as how large or complex the paradigm for that particular language is. Basque, Quechua, Turkish, Zulu are languages with a higher variety of unique MSD patterns. Basque, in particular, has astonishingly more than 1,600 patterns in comparison to the average of around 126 patterns per language in high datasets. The same thing can be seen for low and medium data. Almost all of the lines are associated with different MSD patterns in the low training dataset. Furthermore, Basque also topped as the language with the highest number of unseen MSD patterns for all dataset sizes.  We also count the number of rules found in the dataset (see the last two rows in Table 1). These rules are not the morphological rules defined by linguists but the one extracted from the method explained in Section 5.3.1. For all languages and all datasets, we count how many unique rules can be extracted and relatively unseen to the respective dev dataset. Telugu, Tatar, and Swahili are the languages with the lowest number of unseen rules. We expect to have good performance in these languages because it means that most of the transformations from lemma into the target form are present in the training data.

Baseline system: morpheme-based
The CoNLL-SIGMORPHON 2018 organizers provide a baseline system which is a morphemebased approach. For each language, it determines whether the language is biased towards prefixing or suffixing. The string will be reversed if the language is biased to prefixing.
For each instance in the training data, it aligns the lemma and target form using Levenshtein distance to cut the word into three categories of candidate: prefix, stem, and suffix. Prefixing and suffixing rules are then extracted and grouped according to the given MSD pattern. The rules are stored as a knowledge in a list of triplets: substring to replace, string replacement, and the number of occurrences. Figure 2 illustrates how the baseline system stores the suffixing rules for English present participle.
In the generation step, it filters the candidate rules by the given target MSD pattern. First, the longest common suffixing rule with the highest number of occurrences is applied. Then the most frequent prefixing rule is applied in the succession to generate the predicted target form.

Holistic approach
Another view on the problem is to see that word forms are connected with other word forms systematically. Based on this observation, we can treat the inflection task as the task of solving analogical equation on words 1 :

Proportional analogy
Analogy is a relationship between four objects: A, B, C, and D usually noted as A : B :: C : D . It states that A is to B as C is to D where the ratio between A and B is the same as the ratio between C and D. Here, we consider analogy as a possible way to explain derivation between words as it is already used from the ancient Greek and Latin grammatical tradition up to recent works on computational linguistics, like (Hathout, 2008(Hathout, , 2009. Various formalisations of analogy have been proposed in (Yvon, 2003;Lepage, 2004;Stroppa and Yvon, 2005). In this work, we select the following definition 2 .
(1) We can construct analogical grids Lepage, 2017b, 2018) to give a compact view of different analogies that emerge from a set of words contained in a corpus. An analogical grid is a MxN matrix of words. The special property of this matrix is that any four words from two rows and two columns is an analogy (see Formula 2). P 1 1 : P 2 1 : · · · : P m 1 P 1 2 : P 2 2 : · · · : P m 2 . . . . . . . . .

Solving analogical equation to generate word form
In contrast to the baseline system which uses a morpheme-based approach, our holistic approach  does not break words in pieces (Singh, 2000;Singh and Ford, 2000;Neuvel and Singh, 2001). We generate the target form by solving analogical equation based on the evidence observed in the given training data. First, the relevant analogical grid is selected according to the given target MSD pattern. If several candidates of analogical equation exist, we use some heuristic features to select the analogical equation. These heuristics are edit distance, longest common subsequence, longest common suffix, and longest common prefix, between the given lemma and lemmata existed in the training dataset. If there are still several candidates after using heuristic features, we solve all of the possible analogical equations to generate all the possible predicted target form. The most frequent answer is chosen as the predicted target form.
For example, Figure 3 illustrates how to generate the target form for the example given in Figure 1. Let us say that we are able to get two analogical grids according to the given MSD pattern. We construct the analogical equation as follows: lemma t : form t :: illustrate : form q taken from the first and second column of the analogical grids according to the given MSD pattern. Based on longest common suffix, we choose to use the one in the top which produces the word form illustrating instead of the bottom one which produces illustrateing.

Neural approach
Following the recent success of neural approach in previous evaluation campaign, we implement a common architecture of seq2seq model. We treat the inflection task as the problem of translating the given target MSDs and lemma into target form. Thus, the input string for the example given in Figure 1 will be as follows.
V V.PTCP PRS i l l u s t r a t e

Seq2seq model
Our model is a standard seq2seq model with attention mechanism inspired from the one which is used for machine translation (Luong et al., 2015). The difference is that we consider a character or MSD as one token, instead of a word. Each token (character) is represented by a continuous vector representation learned in the embedding layer.
We use a bi-directional Gated Recurrent Unit (GRU) cell (Cho et al., 2014) which is a variation of Long Short-Term Memory (LSTM) cell (Hochreiter and Schmidhuber, 1997) that tries to solve the vanishing gradient problem. Our decoder is two layers of uni-directional GRU cell with attention mechanism. There are various im-plementations of attention mechanism like (Bahdanau et al., 2015;Luong et al., 2015). In this work, we use the one that has the weight normalization (Salimans and Kingma, 2016) to help the model converges faster.
To handle the unseen tokens, we remember them in a First-In-First-Out (FIFO) list and replace them with a special token <UNK> before feeding them into our model. These special tokens are reverted back to the character contained in the list after the decoding phase.

Hyperparameters
We fixed our hyperparameters for all languages and amounts of resources after doing some preliminary experiments. The number of hidden units is fixed to 100 for each layer in the encoder and decoder. The size of the embedding is 300. We optimize the model using ADAM (Kingma and Ba, 2015) with learning rate of 5 × 10 4 during training. To make the training process faster, we use mini-batch size of 20.
We trained the model using early-stop mechanism of 30 epochs without improvement on validation data which is a set of lines randomly selected from the original training data.

Simple data augmentation
Preliminary results show that the neural approach suffers from the lack of data. To tackle this problem, we perform a simple data augmentation which artificially creates additional training data from evidences seen in the original training data. Additional training data is expected to bring improvement to the performance of our model, especially on low data situation Bergmanis et al., 2017;Silfverberg et al., 2017;Zhou and Neubig, 2017;Nicolai et al., 2017).

Rule extraction
We find the longest common substring between lemma and target form. The left part is assumed as prefix candidate, while the right part is assumed as suffix candidate. Figure 4 shows several examples of rules extracted from the training data in three different languages.
To capture situational affixing where the next or previous character influences the changes, we added the first character from the longest common substring to the extracted prefix candidate and the last character for the suffix candidate. This, for  example, happens for regular past form in English where you add only -d as suffix for lemmata ended with e, instead of adding -ed At a glance, it looks similar to how the baseline system extracts the affix rules. However, we only memorize the left (prefix candidate) and right part (suffix candidate), not all of the possible affix combinations with the stem as the baseline system does. It simplifies the rules extraction, and thus, gives us a smaller number of extracted rules in comparison to the baseline system.

Creating additional training data
For each rule which appears less than 10 times in the training data, we artificially create 5 instances of additional training data. The additional training  data is constructed by using a random string with the length of random integer between 1 to 4. Here, we do not employ any language model to asses the probability of the character sequence like the one described in (Silfverberg et al., 2017). For example, we can create the following additional training instance for the examples given in Figure 4. Characters written in boldface are patterns from the extracted rules. Irish: fbsód =⇒ na fbsódí French: aifrir =⇒ aifrît German: einsraftließen =⇒ sraftlössest ein

Experiment Protocol
We evaluate the performance of the systems using average on accuracy. Accuracy is the ratio of correctly predicted target forms by the total number of questions. Please refer to Formula 3 for the exact definition 3 .
(3) We carry experiments using training dataset and measure the accuracies on dev dataset for all the languages for all training dataset sizes. The system which has the highest score will be picked as our representative system in the test phase for that particular language and dataset size. Table 2 shows the average accuracy in all languages for each of the systems. Our holistic approach is able to perform as good as the baseline system, even slightly better under all of the three dataset sizes. This is the same observation found in (Fam and Lepage, 2017a) on previous year dataset.

Results
The results show that the neural approach using seq2seq model left behind both the baseline 3 N is the total number of questions. δ(A = B) equals to 1 if the two strings A and B are same, or else it is 0. system and the holistic approach on medium and high data situation. The gap is around 15 accuracy points. However, the lack of training data exhibits the drawback of the neural approach as it performs poorly under low data situation. Furthermore, the use of data augmentation improves the performance in most cases. We can see an improvement of around 3 times better accuracy on low dataset although it still cannot overcome the performance of either the baseline nor the holistic approach.
The baseline system and the holistic approach shine over the neural approach particularly for languages like Albanian, Czech, Haida, Neapolitan, Norwegian-Bokmaal, and Uzbek. Our seq2seq model seems to struggle even on high data situation for some of these languages. On the other hand, our seq2seq model gets better accuracy than the baseline system or holistic approach even on low data situation in some languages like Azeri, Basque, Breton, Cornish, Greenlandic, Hindi, Karelian, Khaling, Maltese, Middle-Low-German, Middle-High-German, Murrinhpatha, Norman, North-Frisian, Persian, Swahili, Turkish, Turkmen, Welsh, Zulu.
The same trend can be seen on the results for similar languages, like Romance (Catalan, Galician, Portuguese, and Spanish), Semitic (Arabic and Hebrew), and Baltic (Latvian and Lithuanian) languages. The baseline system leads the score on low dataset size before started to be outperformed by our seq2seq model on the dataset with bigger sizes. For other language families like Indo-Aryan (Bengali, Hindi, Urdu), Finnic (Estonian and Finnish), and Turkic (Turkish and Turkmen) languages, our seq2seq model steadily leads the score for all dataset sizes. Please refer to Table 3 for detail results per language.

Discussion
The results for the baseline system and our holistic approach show the absence of necessity to break down the words into morpheme. The derivation between lemma and target form can also be acquired through analogy. However, selecting the candidates for constructing the analogical equation is a crucial thing. Thus, we need to improve our selection method or use better heuristic features. To handle the problem of unseen MSD patterns, the use of formal concept analysis (Ganter and Wille, 1999) is worth to consider.      Table 3: Accuracy scores on development set (dev) in each language for baseline system (B), holistic approach(H), our seq2seq model without data augmentation (S) and with data augmentation (S+Aug).
The improvement shown by using data augmentation seems promising. One may think to increase the amount of the artificially created additional training data. However, there is a tradeoff between performance and training time. Another thing to consider is how many more additional training data should be created. We can see that the data augmentation seems not to improve the performance on high data situation anymore. In addition, the current method to extract the affix rules is very simple. Although it may capture circumfixes, it is still strongly biased to prefixing and suffixing only. A better method is expected to also capture other phenomena, such as parallel infixing (Arabic), repetition (Greek), and reduplication (Malay, Indonesian).

Conclusion
We developed several systems for morphological inflection task. The first one is based on a holistic approach. We generate the target forms by solving analogical equations on words. The second one is a seq2seq neural network model. A simple data augmentation is also implemented to help on low data situation. We evaluated their performance on the development dataset and choose the best system on each language and dataset size as our representative system for the submission.
Experimental results show that the neural approach using seq2seq model has the best performance in most cases on medium and high data situation. However, both baseline and our holistic approach are toe-to-toe on low data situation.