Phonological Features for Morphological Inflection

Modeling morphological inflection is an important task in Natural Language Processing. In contrast to earlier work that has largely used orthographic representations, we experiment with this task in a phonetic character space, representing inputs as either IPA segments or bundles of phonological distinctive features. We show that both of these inputs, somewhat counterintuitively, achieve similar accuracies on morphological inflection, slightly lower than orthographic models. We conclude that providing detailed phonological representations is largely redundant when compared to IPA segments, and that articulatory distinctions relevant for word inflection are already latently present in the distributional properties of many graphemic writing systems.


Introduction
Models of morphology are important to many tasks in Natural Language Processing, but also present new challenges of their own. Morphologically complex languages require analysis that is often only captured at the morpheme level, but is essential for syntactic or semantic representations. This requires effective morphological analysis, which often receives less attention than other subfields of Natural Language Processing. One relevant task in morphology is that of morphological inflection: automatically generating the inflected form of a lemma according to a given morphological specification. An example of this in English is walk + 3 + SG + PRES → walks . There has been recent success in adopting the encoder-decoder architecture (Kann and Schütze, 2016), which has been effective in machine translation , to this task.
In this work, we explore representing the inputs to such an encoder-decoder model for morphological inflection in two additional ways: IPA segments and bundles of phonological distinctive features. Representing the inputs to an inflection model in phonetic space can unify the character inventory between languages with separate orthographies. The shared character inventory could also enable transfer learning in some instances where it otherwise would be impossible. There are also confusing idiosyncrasies in some orthographies that are not necessarily present in an IPA representation. For example, there are many instances of gemination in English that do not occur in the phonetic realizations of such words, as in control → controlled . English also exhibits several examples of the same sound expressed by completely different orthographic realizations as in fly → flies , or conversely arch (/tS/) ∼ monarch (/k/). Furthermore, a phonetic representation serves as an interface to an even richer representation of characters: phonological distinctive features.
We explore this by representing each IPA segment in a sequence as the combination of its distinctive features. This is potentially useful because (1) a model can learn representations for a fixed set of distinctive features, rather than for each unique IPA segment, and (2) the differences between similar phonemes should be more readily apparent in the distinctive feature representations than the IPA representations. When tasked with generating the past tense of the English verb "stop", transcribed as /stAp/, a model may need to distinguish between both /t/ and /d/ as past tense suffixes, having seen such examples as "kick": /kIk/ → /kIkt/, or "rig": /ôIg/ → /ôIgd/. Rather than the model needing to learn good representations for both /p/ and /k/ as unrelated segments that precede a /t/ in the past tense, a phonological distinctive feature representation would explicitly capture that they share the feature [−voice]. This encourages a model to more quickly find the parameters that correctly gener-ate this voicing assimilation, and produce the form /stApt/. That is, the model that learns from phonological features should quickly be able to generalize that this English past tense is realized as /t/ before voiceless segments. Similarly, in the example of "rob": /ôAb/ → /ôAbd/, the generated /d/ can be conditioned on [+voice] rather than the individual segment /b/.
An alternative hypothesis is that the proposed distinctive feature representation may, however, not have such a profound effect on the inflection model. This is because distributional representations of IPA segments or phonemic graphemes have been shown to capture good approximations of the distinctive feature space (Silfverberg et al., 2018). In order to test these two hypotheses, we experiment on a subset of data provided by task 1 of the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinflection (Cotterell et al., 2017), which introduced 42 more languages than the year before (Cotterell et al., 2016) for a total of 52 languages. We use an existing tool to perform G2P on the data, and, as a second step, to produce distinctive feature vectors from the resulting IPA segments. We evaluate the resulting models on their ability to generate IPA segments.
Related Work Phonetic distributional vectors have been explored for their effectiveness in several NLP applications; especially for informing scenarios that utilize borrowing or transfer learning (Tsvetkov et al., 2016). Phonological distinctive features have also been successfully used to inform NER . However, to our knowledge, there does not seem to be work in learning distributional properties of phonological features that compares them directly to vectors of IPA segments.

Encoder-Decoder Architecture
Our model is implemented as an RNN Encoder-Decoder with attention, built to imitate the model introduced by Kann and Schütze (2016), variations of which found much success in the 2017 CoNLL-SIGMORPHON shared task. The system, pictured in Figure 1, works by learning an encoder RNN over a sequence of embeddings for the input characters or morphological tags. In practice, the encoder is bidirectional. The decoder RNN is initialized with a sequence boundary token, and each state of the decoder is predicted based on the state of the previous timestep, the previous Figure 1: The encoder-decoder with an attention mechanism used for morphological inflection output embedding, and all of the encoder states e i ∈ Encoder. We then use an attention mechanism  to 'attend' over the encoder states, assigning a score to each e i given the previous decoder state d j−1 . The scoring function (Luong et al., 2015) is calculated as where W is a parameter matrix that is learned during training, and [x; y] indicates the concatenation of x and y. These scores are then normalized by applying a softmax over all encoder states in Encoder to compute each i,j−1 . Finally, the attention vector is computed as the weighted mean of all encoder states according to their normalized score: A(d j−1 , E) = n i=0 i,j−1 e i , which is concatenated to the previously decoded embedding before being passed through the decoder. We implement this model in PyTorch (Paszke et al., 2017), using Gated Recurrent Units (GRU)  for the encoder and decoder, and optimize with stochastic gradient descent.

Embedding Inputs
The inputs to this model are sequences of character and tag embeddings. To this end, each Unicode character codepoint or tag is a one-hot vector c, and an embedding matrix E ∈ R |Σ|×n is computed to store the parameters that map the |Σ|-dimensional one-hot vectors to n-dimensional dense vectors, where Σ is the character and tag vocabulary. Similarly we use a matrix I ∈ R |Σ IP A |×n for embedding IPA segments, where Σ IP A is the IPA segment and tag vocabulary. To produce the IPA sequence, we use the Python library Epitran, which performs rule-based G2P on language specific mappings (Mortensen et al., 2018).
We then use the Python library PanPhon , which maps IPA segments to features as in Figure 2, to obtain vectors of phonological distinctive features. The features are represented numerically whereby each index of the vector corresponds to a specific feature such as [±coronal] and stores a value from the set {1, 0, -1}. These values correspond to 'exhibits feature', 'unspecified for given class of sounds', and 'does not exhibit feature', respectively. In practice we map -1 to 0 to obtain strictly binary feature vectors. Now, each IPA segment can be represented as a vector v which has a 1 for each feature that it exhibits, and a 0 otherwise. The embedding matrix F ∈ R |p|×n , where p is all features, tags, and symbols is no longer just a lookup for IPA segments. Tags are still one-hot vectors, and symbols are one-hot vectors for any character that has no phonological features (e.g. a space, or apostrophe). But the vector for an IPA segment now has a one for each feature that it exhibits, in contrast to the one-hot vectors.
The operation vF is equivalent to summing each F i for which v i = 1. In this way, an IPA segment is the sum of all of its distinctive feature embeddings. In practice, we can take the matrixmatrix product of the entire sequence of feature vectors and F to calculate the matrix that represents a sequence of embeddings. The overall workflow involves passing from orthographic input sequences, through Epitran, and then PanPhon, and finally to phonological distinctive feature embeddings.

Experiments and Results
We evaluate these models on 8 languages that are at the intersection of CoNLL-SIGMORPHON data, Epitran, and PanPhon supported languages, selected to exhibit typological diversity. The languages, split into 2 training settings per the shared task data: Medium (∼1,000 training examples), and High (∼10,000 training examples), and their accuracies are given in Table 1. In the high data setting using orthographic inputs, our implementation performed comparably to the best shared task systems for each language. The slight degradation in performance can be attributed to the fact that we did not use ensemble voting, as the top performing systems in the shared task did (Cotterell et al., 2017), and that this is a comparison to the maximum score of 25 systems per language, which increases the likelihood that the optimal initialization will have been found. In the medium setting, the difference in accuracy is much more apparent. This is due to the fact that all of the top performing systems in the shared task also used either some type of data augmentation method (Zhou and Neubig (2017), Silfverberg et al. (2017), Sudhakar and Kumar (2017), , Bergmanis et al. (2017)) a hard alignment method (Makarov et al., 2017), or both (Nicolai et al., 2017). These results illustrate the common observation that neural systems require a large amount of data to be very accurate,  Table 2: Ensemble Oracles for each language. If the correct word form is predicted by any of the 3 models, then it is classified as correct. This is compared against 3 text models.
which can be partially addressed by artificially expanding the training data, or enforcing some copy bias into the system.
For both phonetic representation experiments, the decoded outputs are in the inventory of IPA segments, the gold standard of which comes from the deterministic mappings implemented in Epitran. This means that they differ only in terms of the input representation in the encoder. Models trained on both IPA and feature inputs perform comparably to the text model on both the medium and the high setting. There are two main points of interest in the results. (1) The lower performance on average of the IPA and feature models when compared to the text model is almost exclusively due to differences in accuracy for German and English. We attribute this on the one hand to the fact that the orthography of English is often dissimilar to pronunciations and that their orthographies reflect etymological information which is useful in determining a word's inflectional behavior (Scragg, 1974). An example of the discrepancy between spelling and pronunciation is that the English vowel space has about 13 phonetic vowels (Ladefoged and Johnson, 2014), whereas in the orthographic alphabet, there are only 5. Furthermore, the unstressed vowel, schwa (@), can essentially replace any vowel in an unstressed context. We observe that the majority of inaccuracies in the English predictions are related to vowels, and most commonly to a schwa. This indicates that converting the character space to IPA can introduce some new complications. Regarding German, there is no obvious explanation for the lower accuracy, and we believe that a more detailed analysis of the G2P performance is needed in order to explore this. Ex-periments on orthographically and morphophonemically similar languages may also be revealing.
An Ensemble Oracle of all three models is given in Table 2 in order to check if the systems vary in what they learn to predict. The results show that this ensemble outperforms each individual system for any given language. However, when compared to an Ensemble Oracle of three text models, the results are rather similar. The increase in accuracy may simply be due to varying parameters from different random initializations, yielding an effect that is similar to the boosted scores that can be observed in many of the shared task results.
More interesting is the fact that (2) both the IPA and feature representation seem to yield extremely similar accuracies with a paired permutation test p-value of 0.43 over all languages. Even when the training data is rather sparse as in the medium setting, the accuracies remain extremely similar. This suggests that the distributional properties of IPA segments capture the information expressed by distinctive features. Any benefit that representing a segment in terms of its features might have is already available in the IPA embeddings. To further compare these representations, we experiment with models that combine the IPA and feature representations. We attempt to simply add a 'feature' to the distinctive feature vectors for each IPA segment. That is, the feature vector for /@/ would have a 1 for all of its distinctive features, and an additional 1 for that specific segment. We also experiment with concatenation of the embedding found from the feature vector combination and the IPA embedding. The input to the model is a vector of double the embedding size to account for concatenation. The results, given in Table 3, show that neither experiment seems to have much effect, and the accuracies reflect the initial results.

Conclusion and Future Work
We have experimented with morphological inflection on 8 different languages and compared results between an input space of IPA segments, and one represented as bundles of phonological distinctive features. The results show that both types of inputs behave similarly. This indicates that the distributional properties of IPA segments align with those found by phonological distinctive features, at least to the extent that articulatory information is relevant to inflection. Furthermore, when compared to a baseline of a purely orthographic space, it is ev-  ident that for many languages the results are still mostly redundant, and if there is a large discrepancy in accuracy it is in favor of the orthographic inputs.
There is still work to be done to explore if there are scenarios where bundles of distinctive features provide an advantage. That is, in the case of transfer learning where the phonology of a language is known, it becomes possible to approximate vector representations for unseen segments. Similarly, distinctive features may be better at representing segments that rarely appear in a training set for a given language.