One-Size-Fits-All Multilingual Models

This paper presents DeepSPIN’s submissions to Tasks 0 and 1 of the SIGMORPHON 2020 Shared Task. For both tasks, we present multilingual models, training jointly on data in all languages. We perform no language-specific hyperparameter tuning – each of our submissions uses the same model for all languages. Our basic architecture is the sparse sequence-to-sequence model with entmax attention and loss, which allows our models to learn sparse, local alignments while still being trainable with gradient-based techniques. For Task 1, we achieve strong performance with both RNN- and transformer-based sparse models. For Task 0, we extend our RNN-based model to a multi-encoder set-up in which separate modules encode the lemma and inflection sequences. Despite our models’ lack of language-specific tuning, they tie for first in Task 0 and place third in Task 1.


Introduction
Character transduction tasks such as grapheme-tophoneme conversion (g2p) and morphological inflection are important in many practical real-world applications. However, it is often difficult to train models for these tasks with deep learning techniques, due to the scarcity of labeled data for most of the world's languages. In these circumstances, it is common to use a non-neural method with a stronger inductive bias (Novak et al., 2016) or to generate synthetic data that hopefully ameliorates the data scarcity problem. We find both of these choices unsatisfying. First, older non-neural techniques have a higher floor but also a lower ceilingprevious SIGMORPHON shared tasks have shown that neural methods outpace them in the presence of even moderate quantities of data (Cotterell et al., 2017). Second, although data augmentation has proven helpful for morphological inflection (Anas-tasopoulos and Neubig, 2019), any data augmentation procedure makes implicit assumptions about language structure: techniques that work for Western languages may fail when confronted with reduplication, vowel harmony, or non-concatenative morphology. The kinds of languages for which labeled data are scarce are precisely the languages for which NLP practitioners' assumptions are most suspect. Therefore, our submissions to this shared task make use of a third alternative: multilingual training. Similarly to hallucinated data, multilingual training improves results in low resource settings by acting as a regularizer. However, the models it yields are more versatile, as they are capable of good performance on several languages at the same time. We show that our technique is competitive with state-of-the-art monolingually trained models regardless of training data size for both g2p and morphological inflection. This is despite our approach having a significant disadvantage from a tuning perspective -while conventional monolingual models can tune their hyperparameters separately for each language, we use exactly the same model for each language within a submission.
Our contributions are as follows: • We reimplement gated sparse two-headed attention  and apply it to a massively multilingual setting. We submit versions of this model using 1.5-entmax  and sparsemax (Martins and Astudillo, 2016) as softmax alternatives.
We tie for first place in Task 0 (Vylomova et al., 2020). Among the winners, ours are the only multilingual models.
• We show that sparse seq2seq techniques, previously used for morphological inflection and machine translation , are also effective for multilingual g2p. We make four submissions to Task 1 (Gorman et al., 2020), which differ based on their choice of softmax replacement (1.5-entmax or sparsemax) and their architecture (RNN or transformer). Our strongest models finish third in word error rate (WER) and second in phoneme error rate (PER). Our submissions record the top result on at least one metric for 7 out of 15 languages, including 4 out of 5 surprise languages.

Models
The common theme of the models we submit is their use of sparse functions for attention weights and output distributions, in place of the betterknown softmax (Bridle, 1990). Sparse functions have the following motivations: • Sparse attention has previously shown success on morphological inflection . It allows the decoder to attend to a small number of source positions at each time step, unlike the dense softmax. While hard attention has previously performed well for character transduction (Aharoni and Goldberg, 2017;Makarov et al., 2017;Wu et al., 2018;Wu and Cotterell, 2019), it usually requires an elaborate and slow training procedure. On the other hand, sparse attention does not require any training techniques beyond those used for standard seq2seq models.
• Sparse output distributions allow probability mass to be concentrated in a small number of hypotheses. In practice, this happens frequently for morphological inflection , sometimes making beam search exact.

Entmax and its loss
Our tool for achieving sparsity is the entmax activation function , which is parameterized by a scalar α ≥ 1 and maps a vector z ∈ R n onto the n-dimensional probability simplex n := {p ∈ R n : p ≥ 0, 1 p = 1}: where is the Tsallis α-entropy (Tsallis, 1988). For purposes of the shared task, the key point is that α controls the sparsity of the distribution. α = 1 recovers softmax, while any value greater than 1 can result in a sparse probability distribution. Sparsemax (Martins and Astudillo, 2016) is equivalent to entmax with α = 2.
An important note about models with sparse output layers is that they cannot be trained with cross entropy loss, as the cross entropy loss becomes infinite when the model assigns zero probability to the gold label. Fortunately, for each value α, there is a corresponding loss function, which is given by where p := α-entmax(z). This is an instance of a Fenchel-Young loss (Blondel et al., 2020).

Task 0 Architecture
For morphological inflection, we use an RNNbased two-encoder model with gated attention (Peters and Martins, 2019). In this model, two separate bidirectional LSTMs (Graves and Schmidhuber, 2005) encode the lemma character sequence and the set of inflectional tags. A unidirectional LSTM (Hochreiter and Schmidhuber, 1997) decoder then generates the target sequence. The decoder is similar to a conventional RNN decoder with input feeding, except that separate attention mechanisms compute context vectors independently for each encoder. A gate function then interpolates the two context vectors. Like Peters and Martins (2019), we use a sparse gate, which allows the model to completely ignore one encoder or the other at each time step. Each individual attention head uses bilinear attention (Luong et al., 2015).

Task 1 Architecture
We experiment with both RNN-based (Bahdanau et al., 2015) and transformer-based (Vaswani et al., 2017) models for g2p. As in Task 0, our RNNs use input feeding and bilinear attention.

Handling Multilinguality
Multilingual NLP tasks are intrinsically more difficult than their monolingual counterparts, as the correct way to process a sample depends on what sample the language is in. A simple approach to multilingual NLP is to append a token to each input sequence identifying the language of the sample; this has proven effective for both g2p (Peters et al., 2017) and morphological inflection , and is similar to techniques for multilingual neural machine translation (Johnson et al., 2017). However, this technique has drawbacks: it forces the true characters and the language token to "compete" for attention, and it requires the learned language embedding to have the same size as the character embeddings.
Therefore, we use the alternative technique of concatenating a language embedding to the encoder and decoder input at each time step. Within an example, the language embedding is the same across all time steps. We do not tie language embeddings between the encoder (or encoders) and decoder, allowing each model to learn different language representations for different purposes.

Preprocessing
Task 0 We used character-level tokenization for lemma and inflected forms. Each inflectional tag was treated as a separate token.
Task 1 Prior to training, we decomposed compound characters in the grapheme sequences in all languages. For most languages, this simply amounts to splitting diacritics and their base characters into separate tokens. For Korean, however, it makes a major difference due to the unique structure of the Hangul alphabet. Individual letters in Hangul, called jamo, are composed into blocks representing syllables. Modern Hangul contains 40 jamo, but the number of possible syllables licensed by Korean phonotactics is much larger. Consequently, a naïve tokenization of the Korean training data gives a vocabulary size of 834 types, of which more than 30% occur only once. We suspect that the lack of jamo tokenization is the reason for the baselines' poor performance on Korean.

Experimental Set-up
We ran experiments with three sparse seq2seq architectures: RNNs for inflection, RNNs for g2p, and transformers for g2p. For entmax, we used two α values: 1.5 and 2 (i.e. sparsemax). We used the same α value in both the attention mechanism and loss function. Combining the architectures and entmax functions gives six model configurations. For each, we trained three 1 model runs with the 1 Due to time constraints, the TRANSFORMER-SPARSEMAX ensemble used only two models.   same hyperparameters. At test time, we ensembled the models by averaging their probabilities.

Training
We implemented our models with JoeyNMT (Kreutzer et al., 2019). 2 Our hyperparameters are shown in Table 1. Each model was trained with early stopping for a maximum of 100 epochs. We used greedy decoding at validation time, saving the model if it had the best character error rate so far. We used the Adam optimizer (Kingma and Ba, 2015). For RNNs, we set the initial learning rate to 0.001, reducing it by half whenever the model failed to improve for two consecutive validations. Validation was performed every 10,000 steps for Task 0 and every 500 steps for Task 1. Transformers were trained with a linear learning rate warm up for 4,000 steps, after which the learning rate was decayed by an inverse square root schedule.

Results
At test time, we decoded with a beam size of 5. Task 0 results are shown in Table 2 and Task 1 results are in Table 3. For Task 0, our sparsemax model outperforms a very strong baseline, with entmax not far behind. For Task 1, all of our models outperform all three baselines. In both tasks, the baselines were trained monolingually, so they were able to use language-specific hyperparameter tuning that is unavailable for multilingual models.

Analysis
Next we consider a few questions that multilingual models raise.

How much data does inflection need?
All other things being equal, we expect the performance of a model to improve as the amount of training data is increased. And indeed, this is generally the case, as Figure 1 shows that accuracy is usually above 90% for languages with more than 10,000 training samples. However, there is much more diversity of performance at smaller training sizes. Per-family development set results are shown in Table 4. While families like Niger-Congo record very strong results with modest resources, Germanic and Uralic struggle despite their large training sets. It is likely that certain morphological patterns are easier to learn than others, but we hesitate to make strong statements. Often results are very different between closely related languages, such as Danish (68.20% on dev) and Swedish (99.20%). More research is needed to identify other factors besides morphological typology that influence results.

Crosslingual Character Embeddings
Learning good word representations has been a prominent subject in NLP for several years (Mikolov et al., 2013;Peters et al., 2018). Although many models operate at the character level, relatively little attention has been paid to the character embeddings themselves. Characters lack semantic meaning, so character embeddings learned for "semantic" tasks are unlikely to learn any particular structure. However, Figure 2 shows that multilingual g2p may be useful for learning phonologically grounded character representations: graphemes from different scripts cluster together if they represent similar phonemes. We suspect that the multilingual training with phonological supervision is a necessary ingredient for this to work -characters from different scripts are never mixed within a single sample, so the grapheme contexts in which they occur are completely disjoint. This idea differs from work on phoneme embeddings (Silfverberg et al., 2018;Sofroniev and Çöltekin, 2018) in that the focus is explicitly on the graphemes. Grapheme embeddings learned for phonological tasks may prove useful for transliteration, or for processing informally romanized text (Irvine et al., 2012) jointly with data from the official orthography.

Related Work
Multi-encoder models Several previous works have considered ways to integrate information from multiple sources in a neural seq2seq model. Although initially proposed as a way to leverage multiparallel data in machine translation (Zoph and Knight, 2016), it has also been used for handling multimodal data, andÁcs (2018) applied it to morphological inflection: our architecture is essentially a sparsified version of this model. Past works have also considered the effect of different strategies for merging the attention from the various encoders (Libovickỳ and Helcl, 2017;Libovickỳ et al., 2018). This is worth exploring for morphological inflection, as  showed that the behavior of the attention gating mechanism varies between language families. The optimal strategy is probably different for different languages.
Phonemes and multilinguality Multilingual methods have previously been used for low resource g2p in conjunction with both non-neural (Deri and Knight, 2016) and neural (Peters et al., 2017;Route et al., 2019) architectures. Our model is essentially identical to Peters et al. (2017)'s, but with a different mechanism for identifying the language, inspired by a technique for learning language embeddings from multilingual language modeling (Östling and Tiedemann, 2017). A natural connection is to work that makes use of typological information in multilingual NLP (Tsvetkov et al., 2016). However, care needs to be taken when applying this to g2p: Bjerva and Augenstein (2018) showed that language representations learned from multilingual g2p generally do not encode typological features because orthographic similarity does not correlate with typological similarity.

Conclusion
We showed that massively multilingual models are competitive with the individually-tuned state of the art for morphological inflection and g2p. We presented the first result applying entmax-based sparse attention and losses to g2p, showing that it performed with both RNN and transformer models. We release our code to facilitate further research.