IT–IST at the SIGMORPHON 2019 Shared Task: Sparse Two-headed Models for Inflection

This paper presents the Instituto de Telecomunicações–Instituto Superior Técnico submission to Task 1 of the SIGMORPHON 2019 Shared Task. Our models combine sparse sequence-to-sequence models with a two-headed attention mechanism that learns separate attention distributions for the lemma and inflectional tags. Among submissions to Task 1, our models rank second and third. Despite the low data setting of the task (only 100 in-language training examples), they learn plausible inflection patterns and often concentrate all probability mass into a small set of hypotheses, making beam search exact.


Introduction
Morphological inflection is the task of producing an inflected form, given a lemma and a set of inflectional tags. A widespread approach to the task is the attention-based sequence-to-sequence model (seq2seq; Bahdanau et al., 2015;Kann and Schütze, 2016); such models perform well but are difficult to interpret. To mitigate this shortcoming, we employ an alternative architecture which combines sparse seq2seq modeling (Peters et al., 2019) with two-headed attention that attends separately to the lemma and inflectional tags (Ács, 2018). The attention and output distributions are computed with the sparsemax function and models are trained to minimize sparsemax loss (Martins and Astudillo, 2016). Sparsemax, unlike softmax, can assign exactly zero attention weight to irrelevant source tokens and exactly zero probability to implausible hypotheses. We apply our models to Task 1 at the SIGMORPHON 2019 Shared Task (McCarthy et al., 2019), which extends morphological inflection to a cross-lingual setting. We present two sparse seq2seq architectures: • DOUBLEATTN (it-ist-01-1) is a reimplementation of the two-headed attention model (Ács, 2018) which substitutes sparsemax and its loss for softmax and cross entropy loss. It uses separate encoders and attention heads for the lemma and inflections, and concatenates the outputs of the attention heads.
• GATEDATTN (it-ist-02-1) replaces the attention concatenation with a sparse gate which interpolates the lemma and inflection attention. The intuition is that the lemma and inflectional tags are not likely to be equally important at all time steps. For example, in a suffixing language, the first several generated characters are likely to be identical to the lemma; inflectional tags are not relevant. The sparse gate allows the model to learn to shift focus between the two attentions while ignoring the other at a given time step.
GATEDATTN and DOUBLEATTN rank second and third, respectively, among submissions to Task 1. In addition, their behavior is highly interpretable: they mostly learn to attend to a single lemma hidden state at a time, progressing monotonically from left to right, while their inflection attention learns patterns which reflect underlying morphological structure. The sparse output layer often allows the model to concentrate all probability mass into a single hypothesis, providing a certificate that decoding is exact. Our analysis shows that sparsity is also highly predictive of performance on the shared task metrics, showing that the models "know what they know".

Models
Our architecture is mostly the same as a standard RNN-based seq2seq model with attention. In this section, we outline the changes needed to extend this model to use sparsemax and two-headed attention in a multilingual setting.

Sparsemax
Our models' sparsity comes from the sparsemax function (Martins and Astudillo, 2016), which computes the Euclidean projection of a vector z ∈ R n onto the n-dimensional probability simplex n := {p ∈ R n : p ≥ 0, 1 p = 1}: Like softmax, sparsemax converts an arbitrary real-valued vector into a probability distribution. The critical difference is that sparsemax can assign exactly zero probability, whereas softmax is strictly positive. Sparsemax is differentiable almost everywhere and can be computed quickly, allowing its use as a drop-in replacement for softmax. It has previously been used in seq2seq for computing both attention weights (Malaviya et al., 2018) and output probabilities (Peters et al., 2019). Sparse attention is attractive in morphological inflection because it resembles hard attention, which has been successful on the task (Aharoni and Goldberg, 2017;Wu et al., 2018).

Encoder-Decoder Model
Multilingual embeddings Each encoder and decoder uses an embedding layer to convert onehot token representations into dense embeddings.
To account for the bilingual nature of Task 1, each of our embedding layers contains two look-up tables: one for the sequence of input tokens, and the other for the language of the sequence. At each time step, the current token's embedding is concatenated to a language embedding. 1 Each encoder and decoder uses a separate embedding layer; no weights are tied. Characters and inflectional tags use embeddings of size D c , while language embeddings are of size D . Thus the total embedding size is D = D c + D .
Encoders The lemma and inflection encoders are both bidirectional LSTMs (Graves and Schmidhuber, 2005). An encoder's forward and backward hidden states are concatenated, forming a sequence of source hidden states. We set the size of these hidden states as D in all experiments.
Decoder The decoder is a unidirectional LSTM (Hochreiter and Schmidhuber, 1997) with input 1 The language embedding is the same at all time steps within an example; there is no code-switching in this task.
feeding (Luong et al., 2015). At time step t, it computes a hidden state s t ∈ R D . Conditioned on s t and the hidden state sequences from the lemma and inflection encoders, a two-headed attention mechanism then computes an attentional hidden states t ∈ R D . The decoder LSTM is initialized only with the lemma encoder's state.
Attention head At time t, an attention head computes a context vector c t ∈ R D conditioned on the decoder state s t and an encoder state sequence H = [h 1 , . . . , h J ]. A head consists of two modular components: 1. Alignment Compute a vector a ∈ R J of alignment scores between s t and H. We use the general attention scorer (Luong et al., 2015), which computes a j := s t W a h j .

Context
Compute the context vector c t as a weighted sum of H: c t := J j=1 π j h j , where π = sparsemax(a) is a sparse vector of alignment scores in the simplex.

In Luong et al.'s single-headed attention, the attentional hidden states
is computed by a concatenation layer from the context vector and pre-attention hidden state. However, our two-headed attention mechanism produces two context vectors and so must be computed differently. We use two different formulations, which we describe next.
DOUBLEATTN uses the same strategy for combining multiple context vectors asÁcs (2018): the lemma and inflection context vectors u t and v t and the target hidden state s t are inputs to a concatenation layer: GATEDATTN, on the other hand, computes separate candidate attentional hidden states for the two context vectors: where W u , W v ∈ R D×2D . We define gate weights W g ∈ R 2×3D and gate bias b g ∈ R 2 and  use a sparse gate to compute weights p t ∈ 2 for the two candidate states: We then stack the two candidate statess ut ands vt into a matrixS t ∈ R 2×D and use the gate weights to computes t as a weighted sum of them: Just as a two-dimensional softmax is equivalent to a sigmoid, this two-dimensional sparsemax is a hard sigmoid, as was pointed out by Martins and Astudillo (2016). It provides extra interpretability in the form of a three-way answer about what is relevant at a time step: the lemma, the inflections, or both.
Sparse outputs After the attentional hidden state is computed, an output layer computes scores for each output type z = W zs + b z . These are then converted into a sparse probability distribution p = sparsemax(z). The model is trained to minimize the sparsemax loss (Martins and Astudillo, 2016), defined as where y is the index of the gold target and e y is a one-hot vector. The sparsemax loss is differentiable, convex, and has a margin, and its gradient is sparse. Although softmax-based models use the cross entropy loss, this is not possible for our models because the cross entropy loss is infinite when the model assigns zero probability to the gold target.

Results
Our test results are shown in Table 1. Our two models ranked second and third among official submissions to Task 1.

Experimental set-up
Each model was trained with early stopping for a maximum of 30 epochs with a batch size of 64. We used the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 10 −3 , which was halved when validation accuracy failed to improve for three consecutive epochs. We tuned the dropout and the number of inflection encoder layers on a separate grid for each language pair. Our hyperparameter ranges are shown in Table 2. At test time, we decoded with a beam size of 5. We oversampled the low resource training data 100 times and did not use synthetic data or filter the corpora. We implemented our models in PyTorch (Paszke et al., 2017) with a codebase derived from OpenNMT-py (Klein et al., 2017). 2 Hyperparameters for Uzbek The Uzbek training set contains only 1060 examples, much smaller than the other high resource corpora, and initial results with Uzbek language pairs were poor. We improved performance by oversampling the Uzbek data 10 times (yielding roughly the same high-low balance as the other pairs) and reducing the initial learning rate to 10 −4 .

Analysis
Next we interpret our models' behavior on a selection of language pairs from Task 1. Table 3 shows the sparsity of our attention mechanisms, averaged across language families. The attention is extremely sparse, especially over the lemma: models attend to fewer than 1.1 lemma characters per target character on average. Sparsity is more varied between language families in the inflection attention. This may be explained by  typological differences between languages, which we next analyze in detail.

Sparse Attention
Turkic languages are characterized by concatenative inflections (Bickel and Nichols, 2013b) which represent individual features (monoexponence; Bickel and Nichols, 2013a). Monoexponence should allow the inflection attention to concentrate on a single tag at a time, and Table 3 confirms that Turkic inflection attention is among the sparsest for both DOUBLEATTN and GATEDATTN models. The Azeri attention plot in Figure 1 illustrates that the inflection attention usually focuses on a single morpheme at a time, with some discrepancies at morpheme boundaries, where other tags may be relevant because of voicing assimilation rules. Furthermore, the sparse gate generally allows the model to focus on only one attention head at a time: in the Azeri example, there is only one position at which both attention heads are used. This position is the final consonant of the lemma, which appears to change because of a phonological environment created by the suffix. The shared task results suggest that sparse inflection attention is a good inductive bias for agglutinative languages: one of our models has the best test accuracy among task submissions on 11 of 20 pairs where the low resource language is Turkic and 11 of 12 pairs in the typologically similar Uralic languages.
Germanic languages present different challenges, which may explain our models' less sparse inflection attention. Often several inflections are fused into a single affix; a familiar example is the German suffix st, which marks a verb as present tense, second person, and singular, but has no separable parts that represent these features individually. The North Frisian plot in Figure 1 demonstrates the less sparse nature of Germanic inflection attention. Producing "wulst" from the lemma "wel" requires both a suffix and a change to the lemma, and multiple inflectional tags are attended to at several time steps. The fusional nature of the morphology means there is not a clear alignment between the inflected sequence and the tags. This in reflected in the fact that at many time steps, DOUBLEATTN and GATEDATTN disagree about which tags to attend to. Unlike in the Turkic example, GATEDATTN's gate usually gives weight to both attention heads. This makes sense because the inflection requires a change to the lemma, not just a suffix that follows it.

Sparse Output Layer
Sparse output probabilities provide tools for analysis that are not available to softmax-based models: when no hypotheses are pruned, they provide a certificate that beam search is exact; when only one hypothesis is possible, this gives an indication of the model's certainty; and when probability is distributed among a small set of hypotheses, it is easy to reason about what phenomena continue to confuse the model.
Certainty When the probability distribution is completely concentrated at each time step, the model will be able to generate only one hypothesis, regardless of the beam width. When this happens for a particular input, the model can be said to be certain for that input. This also trivially guarantees that beam search is exact because no hypotheses have been pruned. As Figure 2 shows, certainty is a strong indication of performance. This suggests future work using certainty as a validation metric alternative to accuracy and loss.
Interpretable ambiguity Our Turkish-Azeri GATEDATTN model demonstrates that there is also useful information to be gleaned from the cases where the model produces multiple hypotheses. • Consonant alternations In 13 of the 21 examples, the hypotheses differ in their treatment of stop consonants, which have very similar phonological alternations in the two languages that are represented in orthography. The ambiguity is a sign that the model has not mastered Azeri phonological rules. Nine of the examples concern lemma-final "k" and "q", which have slightly different rules in Azeri than Turkish. 3 i ç e c e k </s> @ k </s> @ c e k </s> @ k </s>  Figure 3: The Turkish-Azeri GATEDATTN model's full beam search for the Azeri lemma "içmek" and the tags V 3 SG FUT. All other sequences have zero probability. The correct form is "iç@c@k", while the model prefers "içecek", which would be correct with Turkish vowel harmony rules.
• Vowel harmony In two other examples, the model produced multiple guesses for the vowels in the future tense marker. One of these examples is shown in Figure 3. In both instances, Azeri vowel harmony rules would generate "@" in the suffix, but the model instead produced "e", which is correct with Turkish vowel harmony. This shows the influence of the high resource language.
• Other cases The last six non-certain examples consist of a loanword with an unusual character sequence, two instances where one hypothesis has the wrong possessive suffix, and two where a hypothesis inserts or drops a character. The top prediction was nonetheless correct in all six.
This sort of analysis is not possible with traditional dense models because probability can never become concentrated in a small set of hypotheses and it is impossible to separate legitimate ambiguities from the long tail of implausible hypotheses. Figure 3 suggests that our models do a good job of concentrating probability in a small number of hypotheses. This raises the question of whether, by underspecifying the inflectional tags, the set of possible hypotheses in the beam can be made to resemble a lemma's complete paradigm. To investigate, we trained monolingual models with the English, German, and Turkish data from the high resource setting of Task 1 of the CoNLL-SIGMORPHON 2018 Shared Task . We used mostly the same hyperparameters as for this year's submission, except that there are no language embeddings, and the inflection tags are not used and the models have single-headed attention over the lemma sequences. We increased the beam width to 10 in order to accomodate the models' greater uncertainty. With English, this often works well: for the regular verb "jitter", the model's only possible hypotheses are "jittered", "jittering", "jitters", and "jitter", which is the complete paradigm. Irregular verbs often have a handful of other hypotheses, and sometimes the beam gives some probability to misspellings. Something similar can be seen in German, although the beam rarely contains all surface forms. For German nouns, the beam often shows uncertainty about plural formation: the hypotheses for "Nadelbaum" include "Nadelbaume", "Nadelbäume", and "Nadelbäumer", all of which are plausible German plurals. Turkish has very large paradigms, so in general it is not possible to fit all forms into a beam of any reasonable size. However, the hypotheses in the beam do typically correspond to correct forms.

Conclusion
We presented a new style of seq2seq model which brings together two-headed attention (Ács, 2018) and sparse modeling for morphological inflection (Peters et al., 2019). Our models learn sparse attention distributions in both attention heads. Their sparse probability distribution over hypotheses often allows beam search to become exact, while the remaining ambiguities often have a clear linguistic interpretation. The two versions of our model rank second and third among submissions to Task 1.