Towards Zero-shot Language Modeling

Can we construct a neural language model which is inductively biased towards learning human language? Motivated by this question, we aim at constructing an informative prior for held-out languages on the task of character-level, open-vocabulary language modelling. We obtain this prior as the posterior over network weights conditioned on the data from a sample of training languages, which is approximated through Laplace’s method. Based on a large and diverse sample of languages, the use of our prior outperforms baseline models with an uninformative prior in both zero-shot and few-shot settings, showing that the prior is imbued with universal linguistic knowledge. Moreover, we harness broad language-specific information available for most languages of the world, i.e., features from typological databases, as distant supervision for held-out languages. We explore several language modelling conditioning techniques, including concatenation and meta-networks for parameter generation. They appear beneficial in the few-shot setting, but ineffective in the zero-shot setting. Since the paucity of even plain digital text affects the majority of the world’s languages, we hope that these insights will broaden the scope of applications for language technology.


Introduction
With the success of recurrent neural networks and other black-box models on core NLP tasks, such as language modeling, researchers have turned their attention to the study of the inductive bias such neural models exhibit (Linzen et al., 2016;Marvin and Linzen, 2018;Ravfogel et al., 2018). A number of natural questions have been asked. For example, do recurrent neural language models learn syntax (Marvin and Linzen, 2018)? Do they map onto grammaticality judgments (Warstadt et al., 2019)? However, as Ravfogel et al. (2019) note, "[m]ost of the work so far has focused on English." Moreover, these studies have almost always focused on train-ing scenarios where a large number of in-language sentences are available.
In this work, we aim to find a prior distribution over network parameters that generalize well to new human languages. The recent vein of research on the inductive biases of neural nets implicitly assumes a uniform (unnormalizable) prior over the space of neural network parameters (Ravfogel et al., 2019, inter alia). In contrast, we take a Bayesian-updating approach: First, we approximate the posterior distribution over the network parameters using the Laplace method (Azevedo-Filho and Shachter, 1994), given the data from a sample of seen training languages. Afterward, this distribution serves as a prior for maximum-a-posteriori (MAP) estimation of network parameters for the held-out unseen languages.
The search for a universal prior for linguistic knowledge is motivated by the notion of Universal Grammar (UG), originally proposed by Chomsky (1959). The presence of innate biological properties of the brain that constrain possible human languages was posited to explain why children learn languages so quickly despite the poverty of the stimulus (Chomsky, 1978;Legate and Yang, 2002). In turn, UG has been connected with Greenberg (1963)'s typological universals by Graffi (1980) and Gilligan (1989): this way, the patterns observed in cross-lingual variation could be explained by an innate set of parameters wired into a languagespecific configuration during the early phases of language acquisition. Our study explores the task of character-level language modeling. Specifically, we choose an open-vocabulary setup, where no token is treated as unknown, to allow for a fair comparison among the performances of different models across different languages (Gerz et al., 2018a,b;Cotterell et al., 2018;Mielke et al., 2019). We run experiments under several regimes of data scarcity for the held-out languages (zero-shot, few-shot, and joint multilingual learning) over a sample of 77 typologically diverse languages.
As an orthogonal contribution, we also note that realistically we are not completely in the dark about held-out languages, as coarse-grained grammatical features are documented for most world's languages and available in typological databases such as URIEL (Littell et al., 2017). Hence, we also explore a regime where we condition the universal prior on typological side information. In particular, we consider concatenating typological features to hidden states (Östling and Tiedemann, 2017) and generating the network parameters with hypernetworks receiving typological features as inputs (Platanios et al., 2018).
Empirically, given the results of our study, we offer two findings. The first is that neural recurrent models with a universal prior significantly outperform baselines with uninformative priors both in zero-shot and few-shot training settings. Secondly, conditioning on typological features further reduces bits per character in the few-shot setting, but we report negative results for the zero-shot setting, possibly due to some inherent limitations of typological databases (Ponti et al., 2019).
The study of low-resource language modeling also has a practical impact. According to Simons (2017), 45.71% of the world's languages do not have written texts available. The situation is even more dire for their digital footprint. As of March 2015, just 40 out of the 188 languages documented on the Internet accounted for 99.99% of the web pages. 1 And as of April 2019, Wikipedia is translated only in 304 out of the 7097 existing languages. What is more, Kornai (2013) prognosticates that the digital divide will act as a catalyst for the extinction of many of the world's languages. The transfer of language technology may help reverse this course and give space to unrepresented communities.

LSTM Language Models
In this work, we address the task of character-level language modeling. Whereas word lexicalization is mostly arbitrary across languages, phonemes allow for transferring universal constraints on phonotactics 2 and language-specific sequences that may be shared across languages, such as borrowings and 1 https://w3techs.com/technologies/ overview/content_language/all 2 E.g. with few exceptions (Evans and Levinson, 2009, sec. 2.2.2), the basic syllabic structure is vowel-consonant.
cognates (Brown et al., 2008). Since languages are mostly recorded in text rather than phonemic symbols (IPA), however, we focus on characters as a loose approximation of phonemes.
Let Σ be the set of characters for language . Moreover, consider a collection of languages T E partitioned into two disjoint sets of observed (training) languages T and held-out (evaluation) languages E. Then, let Σ = ∪ ∈(T E) Σ be the union of character sets in all languages. A universal, character-level language model is a probability distribution over Σ * . 3 Let x ∈ Σ * be a sequence of characters. We write: where t is a time step, x 0 is a distinguished beginning-of-sentence symbol, w are the parameters, and every sequence x ends with a distinguished end-of-sentence symbol x n . We implement character-level language models with Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997). These encode the entire history x <t as a fixed-length vector h t by manipulating a memory cell c t through a set of gates. Then we define (2) LSTMs have an advantage over other recurrent architectures as memory gating mitigates the problem of vanishing gradients and captures long-distance dependencies (Pascanu et al., 2013).

Neural Language Modeling with a Universal Prior
The fundamental hypothesis of this work is that there exists a prior p(w) over the weights of a neural language model that places high probability on networks that describe human-like languages. Such a prior would provide an inductive bias that facilitates learning unseen languages. In practice, we construct it as the posterior distribution over the weights of a language model of seen languages. Let D be the examples in language , and let the examples in all training languages be D = ∪ ∈T D . Taking a Bayesian approach, the posterior over weights is given by Bayes' rule: We take the prior of eq. (3) to be a Gaussian with zero mean and covariance matrix σ 2 I, i.e.
However, computation of the posterior p(w | D) is woefully intractable: recall that, in our setting, each p(x | w) is an LSTM language model, like the one defined in eq. (2). Hence, we opt for a simple approximation of the posterior, using the classic Laplace method (MacKay, 1992). This method has recently been applied to other transfer learning or continuous learning scenarios in the neural network literature (Kirkpatrick et al., 2017;Kochurov et al., 2018;Ritter et al., 2018). In §3.1, we first introduce the Laplace method, which approximates the posterior with a Gaussian centered at the maximum-likelihood estimate. 4 Its covariance matrix is amenable to be computed with backpropagation, as detailed in §3.2. Finally, we describe how to use this distribution as a prior to perform maximum-a-posteriori inference over new data in §3.3.

Laplace Method
First, we (locally) maximize the logarithm of the RHS of eq. (3): We note that L(w) is equivalent to the log-posterior up to an additive constant, i.e.
where the constant log p(D) is the log-normalizer. Let w be a local maximizer of L. 5 We now approximate the log-posterior with a second-order Taylor expansion around w : where H is the Hessian matrix. Note that we have omitted the first-order term, since the gradient ∇L(w) = 0 at the local maximizer w . This quadratic approximation to the log-posterior is Gaussian, which can be seen by exponentiating the RHS of eq. (7): where exp(L(w )) is simplified from both numerator and denominator. Since w is a local maximizer, H is a negative semi-definite matrix. 6 The full derivation is given in App. C. In principle, computing the Hessian is possible by running backpropagation twice: This yields a matrix with d 2 entries. However, in practice, this is not possible. First, running backpropagation twice is tedious. Second, we can not easily store a matrix with d 2 entries since d is the number of parameters in the language model, which is exceedingly large.

Approximating the Hessian
To cut the computation down to one pass, we exploit a property from theoretical statistics: Namely, that the Hessian of the log-likelihood bears a close resemblance to a quantity known as the Fisher information matrix. This connection allows us to develop a more efficient algorithm that approximates the Hessian with one pass of backpropagation.
We derive this approximation to the Hessian of L(w) here. First, we note that due to the linearity of ∇ 2 , we have Note that the integral over languages ∈ T is a discrete summation, so we may exchange addends and derivatives such as is required for the proof.
We now discuss each term of eq. (9) individually. First, to approximate the likelihood term, we draw on the relation between the Hessian and the Fisher information matrix. A basic fact from information theory (Cover and Thomas, 2006) gives us that the Fisher information matrix may be written in two equivalent ways: This equality suggests a natural approximation of the expected Fisher information matrix-the observed Fisher information matrix observed Fisher information matrix which is tight in the limit as |D| → ∞ due to the law of large numbers. Indeed, when we have a large number of training exemplars, the average of the outer products of the gradients will be a good approximation to the Hessian. However, even this approximation still has d 2 entries, which is far too many to be practical. Thus, we further use a diagonal approximation. We denote the diagonal of the observed Fisher information matrix as the vector f ∈ R d , which we define as where the (·) 2 is applied point-wise. Computation of the Hessian of the prior term in eq. (9) is more straightforward and does not require approximation. Indeed, generally, this is the negative inverse of the covariance matrix, which in our case means Summing the (approximate) Hessian of the loglikelihood in eq. (12) and the Hessian of the prior in eq. (13) yields our approximation to the Hessian of the log-posterior The full derivation of the approximated Hessian is available in App. D.

MAP Inference
Finally, we harness the posterior p(w | D) ≈ N (w , −H −1 ) as the prior over model parameters for training a language model on new, held-out languages via MAP estimation. This is only an approximation to full Bayesian inference, because it does not characterize the entire distribution of the posterior, just the mode (Gelman et al., 2013).
In the zero-shot setting, this boils down to using the mean of the prior w as network parameters during evaluation. In the few-shot setting, instead, we assume that some data for the target language ∈ E is available. Therefore, we maximize the log-likelihood given the target language data plus a regularizer that incarnates the prior, scaled by a factor of λ: We denote the the prior N (w , −H −1 ) that features in eq. (15) as UNIV, as it incorporates universal linguistic knowledge. As a baseline for this objective, we perform MAP inference with an uninformative prior N (0, I), which we label NINF. In the zero-shot setting, this means that the parameters are sampled from the uninformative prior. In the few-shot setting, we maximize Note that, owing to this formulation, the uninformed NINF model does not have access to the posterior of the weights given the data from the training languages. Moreover, as an additional baseline, we consider a common approach for transfer learning in neural networks (Ruder, 2017), namely 'fine-tuning.' After finding the maximum-likelihood value w on the training data, this is simply used to initialize the weights before further optimizing them on the held-out data. We label this method FITU.

Language Modeling Conditioned on Typological Features
Realistically, the prior over network weights should also be augmented with side information about the general properties of the held-out language to be learned, if such information is available. In fact, linguists have documented such information even for languages without plain digital texts available and stored it in the form of attribute-value features in publicly accessible databases (Croft, 2002;Dryer and Haspelmath, 2013).
The usage of such features to inform neural NLP models is still scarce, partly because the evidence in favor of their effectiveness is mixed (Ponti et al., 2018(Ponti et al., , 2019. In this work, we propose a way to distantly supervise the model with this side information effectively. We extend our non-conditional language models outlined in §3 (BARE) to a series of variants conditioned on language-specific properties, inspired by Östling and Tiedemann (2017) and Platanios et al. (2018). A fundamental difference from these previous works, however, is that they learn such properties in an end-to-end fashion from the data in a joint multilingual learning setting. Obviously, this is not feasible for the zeroshot setting and unreliable for the few-shot setting. Rather, we represent languages with their typological feature vector, which we assume to be readily available both for both training and held-out languages.
Let t ∈ [0, 1] f be a vector of f typological features for language ∈ T E. We reinterpret the conditional language models within the Bayesian framework by estimating their posterior probability We now consider two possible methods to estimate p(w | t ). For both of them, we first encode the features through a non-linear transformation Östling and Tiedemann (2017). Assuming the standard LSTM architecture where o t is the output gate and c t is the memory cell, we modify the equation for the hidden state h t as follows: where stands for the Hadamard product and ⊕ for concatenation. In other words, we concatenate the typological features to all the hidden states. Moreover, we experiment with a second variant where the parameters of the LSTM are generated by a hyper-network (i.e., a simple linear layer with weight W ∈ R |w|×r ) that transforms f (t ) into w. This approach, labeled PLAT, is inspired by Platanios et al. (2018), with the difference that they generate parameters for an encoder-decoder architecture for neural machine translation.
On the other hand, we do not consider the conditional model proposed by Sutskever et al. (2014), where f (t ) would be used to initialize the values for h 0 and c 0 . During the evaluation, for all time steps t, h t and c t are never reset on sentence boundaries, so this model would find itself at a disadvantage because it would require either to erase the sequential history cyclically or to lose memory of the typological features.

Experimental Setup
Data The source for our textual data is the Bible corpus 7 (Christodouloupoulos and Steedman, 2015). 8 We exclude languages that are not written in the Latin script and duplicate languages, resulting in a sample of 77 languages. 9 Since not all translations cover the entire Bible, they vary in size. The text from each language is split into training, development, and evaluation sets (80-10-10 percent, respectively). Moreover, to perform MAP inference in the few-shot setting, we randomly sample 100 sentences from the train set of each held-out language.
We obtain the typological feature vectors from URIEL (Littell et al., 2017). 10 We include the features related to 3 levels of linguistic structure, for a total of 245 features: i) syntax, e.g. whether the subject tends to precede the object. These originate from the World Atlas of Language Structures (Dryer and Haspelmath, 2013) and the Syntactic Structures of the World's Languages (Collins and Kayne, 2009); ii) phonology, e.g. whether a language has distinctive tones; iii) phonological inventories, e.g. whether a language possesses the retroflex approximant /õ/. Both ii) and iii) were originally collected in PHOIBLE (Moran et al., 2014). Missing values are inferred as a weighted average of the 10 nearest neighbor languages in terms of family, geography, and typology.   Language Model We implement the LSTM following the best practices and choosing the hyper-parameter settings indicated by Merity et al. (2018b,a). Specifically, we optimize the neural weights with Adam (Kingma and Ba, 2014) and a non-monotonically decayed learning rate: its value is initialized as 10 −4 and decreases by a factor of 10 every 1/3rd of the total epochs. The maximum number of epochs amounts to 6 for training on D T , with early stopping based on development set performance, and the maximum number of epochs is 25 for few-shot learning on D ∈E .
For each iteration, we sample a language proportionally to the amount of its data: p( ) ∝ |D |, in order not to exhaust examples from resourcelean languages in the early phase of training. Then, we sample without replacement from D a minibatch of 128 sequences with a variable maximum sequence length. 11 This length is sampled from a distribution m ∼ N (µ = 125, σ = 5). 12 Each epoch ends when all the data sequences have been sampled.
We apply several techniques of dropout for regularization, including variational dropout (Gal and Ghahramani, 2016), which applies an identical mask to all the time steps, with p = 0.1 for character embeddings and intermediate hidden states and p = 0.4 for the output hidden states. Drop-Connect (Wan et al., 2013) is applied to the model parameters U of the first hidden layer with p = 0.2.
Following Merity et al. (2018b), the underlying language model architecture consists of 3 hidden layers with 1,840 hidden units each. The dimensionality of the character embeddings is 400. We tie input and output embeddings following Merity et al. (2018a). For conditional language models, the dimensionality of f (t ) is set to 115 for the OEST method based on concatenation (Östling and Tiedemann, 2017), and 4 (due to memory limitations) in the PLAT method based on hyper-networks (Platanios et al., 2018). For the regularizer in eq. (15), we perform grid search over the hyper-parameter λ: we finally select a value of 10 5 for UNIV and 10 −5 for NINF.

Regimes of Data Paucity
We explore different regimes of data paucity for the held-out languages: • ZERO-SHOT transfer setting: we split the sample of 77 languages into 4 partitions. The languages in each subset are held out in turn, and we use their test set for evaluation. 13 For each subset, we further randomly choose 5 languages whose development set is used for validation. The training set of the rest of the languages is used to estimate a prior over network parameters via the Laplace approximation.
• FEW-SHOT transfer setting: on top of the zeroshot setting, we use the prior to perform MAP inference over a small sample (100 sentences) from the training set of each held-out language.
• JOINT multilingual setting: the data includes the full training set for all 77 languages, including held-out languages. This serves as a ceiling for the model performance in cross-lingual transfer.

Results and Analysis
The results for our experiments are grouped in Table 1 for the ZERO-SHOT regime, in Table 3 for the FEW-SHOT regime, and in Table 2 for the JOINT multilingual regime, which constitutes a ceiling to cross-lingual transfer performances. The scores represent Bits Per Character (BPC; Graves, 2013): this metric is simply defined as the negative loglikelihood of test data divided by ln 2. We compare the results along the following dimensions: Informativeness of Prior Our main result is that the UNIV prior consistently outperforms the NINF prior across the board and by a large margin in both ZERO-SHOT and FEW-SHOT settings. The scores of the naïvest baseline, ZERO-SHOT NINF BARE, are considerably worse than both ZERO-SHOT UNIV models: this suggests that the transfer of information on character sequences is meaningful. The lowest BPC reductions are observed for languages like Vietnamese (15.94% error reduction) or Highland Chinantec (19.28%) where character inventories differ the most from other languages. Moreover, the ZERO-SHOT UNIV models are on a par or better than even the FEW-SHOT NINF models. In other words, the most helpful supervision comes from a universal prior rather than from a small in-language sample of sentences. This demonstrates that the UNIV prior is truly imbued with universal linguistic knowledge that facilitates learning of previously unseen languages.
The averaged BPC score for the other baseline without a prior, FINE-TUNE, is 3.007 for FEW-SHOT OEST, to be compared with 2.731 BPC of UNIV. Note that fine-tuning is an extremely competitive baseline, as it lies at the core of most stateof-the-art NLP models (Peters et al., 2019). Hence, this result demonstrates the usefulness of Bayesian inference in transfer learning.

Conditioning on Typological Information
Another important result regards the fact that conditioning language models on typological features yields opposite effects in the ZERO-SHOT and FEW-SHOT settings. Comparing the columns of the BARE and OEST models in Table 1 reveals that the non-conditional baseline BARE is superior for 71 / 77 languages (the exceptions being Chamorro, Croatian, Italian, Swazi, Swedish, and Tuareg). On the other hand, the same columns in Table 3 and Table 2 reveal an opposite pattern: OEST outperforms the BARE baseline in 70 / 77 languages. Finally, OEST surpasses the BARE baseline in the JOINT setting for 76 / 77 languages (save Q'eqchi').
We also also take into consideration an alternative conditioning method, namely PLAT. For clarity's sake, we exclude this batch of results from Table 1 and Table 3, as this method proves to be consistently worse than OEST. In fact, the average BPC of PLAT amounts to 5.479 in the ZERO-SHOT setting and 3.251 in the FEW-SHOT setting. These scores have to be compared with 4.691 and 2.731 for OEST, respectively.
The possible explanation behind the mixed evidence on the success of typological features points to some intrinsic flaws of typological databases. Ponti et al. (2019) has shown how their feature granularity may be too coarse to liaise with datadriven probabilistic models, and inferring missing values due to the limited coverage of features results in additional noise. As a result, language models seem to be damaged by typological features in absence of data, whereas they benefit from their guidance when at least a small sample of sentences is available in the FEW-SHOT setting.
Data Paucity Different regimes of data paucity display uneven levels of performance. The best models for each setting (ZERO-SHOT UNIV BARE, FEW-SHOT UNIV OEST, and JOINT OEST) reveal large gaps between their average scores. Hence, inlanguage supervision remains the best option when available: transferred language models always lag behind their supervised equivalents.

Related Work
LSTMs have been probed for their inductive bias towards syntactic dependencies (Linzen et al., 2016) and grammaticality judgments (Marvin and Linzen, 2018;Warstadt et al., 2019). Ravfogel et al. (2019) have extended the scope of this analysis to typologically different languages through synthetic variations of English. In this work, we aim to model the inductive bias explicitly by constructing a prior over the space of neural network parameters. Few-shot word-level language modeling for truly under-resourced languages such as Yongning Na has been investigated by Adams et al. (2017) with the aid of a bilingual lexicon. Vinyals et al. (2016) and Munkhdalai and Trischler (2018) proposed novel architectures (Matching Networks and LSTMs augmented with Hebbian Fast Weights, respectively) for rapid associative learning in English, and evaluated them in few-shot cloze tests. In this respect, our work is novel in pushing the problem to its most complex formulation, zero-shot inference, and in taking into account the largest sample of languages for language modeling to date.
In addition to those considered in our work, there are also alternative methods to condition language models on features. Kalchbrenner and Blunsom (2013) used encoded features as additional biases in recurrent layers. Kiros et al. (2014) put forth a log-bilinear model that allows for a 'multiplicative interaction' between hidden representations and input features (such as images). With a similar device, but a different gating method, Tsvetkov et al. (2016) trained a phoneme-level joint multilingual model of words conditioned on typological features from Moran et al. (2014).
The use of the Laplace method for neural transfer learning has been proposed by Kirkpatrick et al. (2017), inspired by synaptic consolidation in neuroscience, with the aim to avoid catastrophic forgetting. Kochurov et al. (2018) tackled the problem of continuous learning by approximating the posterior probabilities through stochastic variational inference. Ritter et al. (2018) substitute diagonal Laplace approximation with a Kronecker factored method, leading to better uncertainty estimates. Finally, the regularizer proposed by Duong et al. (2015) for cross-lingual dependency parsing can be interpreted as a prior for MAP estimation where the covariance is an identity matrix.

Conclusions
In this work, we proposed a Bayesian approach to transfer language models cross-lingually. We created a universal prior over neural network weights that is capable of generalizing well to new languages suffering from data paucity. The prior was constructed as the posterior of the weights given the data from available training languages, inferred via the Laplace method. Based on the results of character-level language modeling on a sample of 77 languages, we demonstrated the superiority of this prior imbued with universal linguistic knowledge over uninformative priors and unnormalizable priors (i.e., the widespread fine-tuning approach) in both zero-shot and few-shot settings. Moreover, we showed that adding language-specific side information drawn from typological databases to the universal prior further increases the levels of performance in the few-shot regime. While cross-lingual transfer still lags behind supervised learning when sufficient in-language data are available, our work is a step towards bridging this gap in the future.

A Character Distribution
Even within the same setting, BPC scores vary enormously across languages in both the ZERO-SHOT and FEW-SHOT settings, which requires an explanation. Similarly to Gerz et al. (2018a,b), we run a correlation analysis between language modeling performance and basic statistics of the data. In particular, we first create a vector of unigram character counts for each language, shown in Fig. 1. Then we estimate the cosine distance between the vector of each language and the average of all the others in our sample. This cosine distance is a measure of the 'exoticness' of a language's character distribution.
Pearson's correlation between such cosine distance and the perplexity of UNIV BARE in each language reveals a strong correlation coefficient ρ = 0.53 and a statistical significance of p < 10 −6 in the ZERO-SHOT setting. On the other hand, such correlation is absent (ρ = −0.13) and insignificant p > 0.2 in the FEW-SHOT setting. In other words, if a few examples of character sequences are provided for a target language, language modeling performance ceases to depend on its unigram character distribution.

B Probing of Learned Posteriors
Finally, it remains to establish which sort of knowledge is embedded in the universal prior. How to probe a probability distribution over weights in the non-conditional UNIV BARE language model? First, we study the signal-to-noise ratio of each parameter w i , computed as |µ i | σ i , in each of the 4 splits. Intuitively, this metric quantifies the 'informativeness' of each parameter, which is proportional to both the absolute value of the mean and the inverse standard deviation of the estimate. The probability density function of the signal-to-noise ratio is shown in Fig. 2. From this plot, it emerges that the estimated uncertainty is generally low (small σ i denominators yield high values). Most crucially, the signal-to-noise values concentrate on the left of the spectrum. This means that most weights will not incur any penalty for changing during few-shot learning based on eq. (15); on the other hand, there is a bulk of highly informative parameters on the right of the spectrum that are very likely to remain fixed, thus preventing catastrophic forgetting. All splits display such a pattern, although somewhat shifted.
Second, to study the effect of conditioning the universal prior on typological features, I generate random sequences of 25 characters from the learned prior in each language. The first character is chosen uniformly at random, and the subsequent ones are sampled from the distribution given by eq. (1) with a temperature of 1. The resulting texts are shown in Table 4. Although this would warrant a more thorough and systematic analysis, from a cursory view it is evident of the sequences abide with universal phonological patterns, e.g. favoring vowels as syllabic nuclei and ordering consonants based on sonority hierarchy. Moreover, the language-specific information clearly steers predicted sequences towards the correct inventory of characters, as demonstrated by Vietnamese (VIE) and Lukpa (DOP) in Table 4

D Derivation of the Approximated Hessian
We assume w ∼ N (0, σ 2 I). Given the relationship among the expected Fisher Information I(w), the observed Fisher Information J (w), the observed Fisher Information based on |D| samples J D (w), and the Hessian H: we can derive our approximation of 1 |D| H: