Multilingual Grammar Induction with Continuous Language Identification

The key to multilingual grammar induction is to couple grammar parameters of different languages together by exploiting the similarity between languages. Previous work relies on linguistic phylogenetic knowledge to specify similarity between languages. In this work, we propose a novel universal grammar induction approach that represents language identities with continuous vectors and employs a neural network to predict grammar parameters based on the representation. Without any prior linguistic phylogenetic knowledge, we automatically capture similarity between languages with the vector representations and softly tie the grammar parameters of different languages. In our experiments, we apply our approach to 15 languages across 8 language families and subfamilies in the Universal Dependency Treebank dataset, and we observe substantial performance gain on average over monolingual and multilingual baselines.


Introduction
Human languages bear striking resemblance at the syntactic level in spite of their diversity on the surface, as many studies have revealed (Greenberg, 1963;Hawkins, 2014). This fact provides the basis for multilingual grammar induction which tries to simultaneously induce grammars of multiple languages. Intuitively, one can couple grammar parameters of different languages with similar typology and learn them simultaneously. However, the lacking of measures of language similarity prevents this idea from being further exploited in practice.
Previous work in multilingual grammar induction either does not consider language similarity measures (Iwata et al., 2010) or models lan- * The first and second authors contributed equally. The third author contributed to this work when at ShanghaiTech University. The fourth author is the corresponding author. guage similarity based on linguistic phylogeny (Berg-Kirkpatrick and Klein, 2010). The phylogenetic knowledge, however, could be misleading in measuring language similarity. For example, English and German are both Germanic languages, but English exhibits dominant Subject-Verb-Object (SVO) word order while German does not.
In this paper, we propose a novel approach to multilingual grammar induction. Our induction model represents language identities as continuous vectors (i.e., language embeddings) and employs a neural network to predict the grammar parameters of each language based on its embedding. The neural network parameters are universally shared across languages, which softly tie the grammar parameters of different languages. The language embeddings and the neural network parameters are trained with a standard grammar induction objective without any guidance from prior linguistic phylogenetic knowledge. We also introduce an auxiliary language identification task in which we predict the language identities of input sentences using the language embeddings.
We evaluate our approach on corpora of 15 languages across 8 language families and subfamilies. We observe that our approach achieves substantial performance gain on average over monolingual and multilingual baselines.

Dependency Model with Valence and Other Related Works
Dependency Model with Valence (DMV) (Klein and Manning, 2004) is the best known generative model for dependency grammar induction. The DMV generates a sentence and its dependency tree following three types of grammar rules (ATTACH, DECISION and ROOT). It firstly samples a token c from the ROOT distribution P ROOT (c) and then recursively decides whether to generate a new child token and what child token to generate by sampling from the DECISION and AT-TACH distributions P DECISION (dec|h, dir, val) and P ATTACH (c|h, dir), where dir is a binary variable representing the direction of generation (left or right), val is a binary variable representing whether the current head token already has a child in the direction dir or not, dec is a binary variable deciding whether to continue generation in the current direction, c is the child token and h is the head token. Almost all previous methods of multilingual grammar induction are based on DMV. Their focus is on designing various priors to couple DMV parameters across languages: Cohen and Smith (2009) propose a logistic normal prior while Berg-Kirkpatrick and Klein (2010) design a hierarchical Gaussian prior according to linguistic phylogeny.
The usage of continuous language embeddings has been explored in other tasks. For example, Ammar et al. (2016) andde Lhoneux et al. (2018) apply language embeddings in supervised multilingual dependency parsing.

Approach
We perform unlexicalized grammar induction in which a sentence is represented as a sequence of part-of-speech (POS) tags. We assume that all the languages share the same set of POS tags.

Multilingual Grammar Model
Our multilingual grammar model adopts the NDMV (Jiang et al., 2016), a monolingual model, as the basic component. NDMV predicts grammar rule probabilities using neural networks. In our model, we add a continuous vector representation of the language identity l (i.e., a language embedding) as an additional input to the neural networks in NDMV. Specifically, to predict an ATTACH rule probability P ATTACH (c|h, dir, val, l), we use a multilayer neural network that takes the embeddings of the head token h, valence val and language identity l as input, uses a weight matrix W dir specific to the direction dir in the first layer, and uses a weight matrix W c consisting of all the child POS tag vectors in the softmax output layer. The neural network structure is shown in the left part of Figure 1. We predict the DECISION rule probabilities in a similar way. We record the number of ROOT rule probabilities instead of predicting them  Figure 1: Model Architecture. The language embedding matrix contains the embeddings of all the languages. l is a one-hot vector and ⊗ means matrix multiplication. Blue bars represent the embeddings of input symbols; brown bars represent the hidden states of the neural networks; green bars represent the logits which are the inputs to the Softmax layers.
since there are only a small number of such rules. The language embeddings are part of the model parameters and are trained simultaneously with all the other parameters. We hope that after training, similar languages would have similar embeddings and therefore similar grammar rule probabilities.

Auxiliary Task
To improve the learning of the language embeddings, we introduce an auxiliary language identification task: given an input sentence represented by a sequence of POS tags {x 1 , x 2 , . . . , x n }, predict its language. We use a standard Bidirectional Long Short-Term Memory (Bi-LSTM) to encode the input sentence, and then a multilayer perceptron to classify the sentence into one of the languages. The weight matrix of the output layer of the multilayer perceptron contains the embeddings of all the languages. The model structure is shown in the right part of Figure 1.

Training
Denote the set of model parameters as Θ, the set of languages as L, the set of grammars of different languages as G = {G l , l ∈ L}, and the train- is the i-th training sentence and l (i) is its language identity. Our training objective function L(Θ) combines two conditional probabilities for each training sentence x (i) : P (x (i) |G l (i) ), the probability of the training sentence x (i) being generated from the corresponding grammar G l (i) ; and P (l (i) |x (i) ), the probability of correct language identification of x (i) .
where λ is a hyper-parameter and is set to 1 by default. We follow the approach in NDMV to optimize the first term and use Adam (Kingma and Ba, 2014) to optimize the second term.
Note that the language identification model is only used during training to improve the learning of language embeddings. During testing, we run the multilingual grammar model to predict grammar rule probabilities without the need to invoke the language identification model.

Setup
We selected 15 languages across 8 language families and subfamilies to ensure diversity. To enable comparisons with previous state-of-the-art approaches (Jiang et al., 2017;Li et al., 2019), we conducted our experiments on UD Treebank 1.4. For each language, we show its language family and the training corpus size in Table 1. We trained our method on the training sentences with length ≤ 15 and tested our method on the testing sentences with length ≤ 40 after removing all punctuations. Since we are doing unsupervised learning, gold dependency trees were not used during training. We use the directed dependency accuracy (DDA, the percentage of words in the testing dataset which are assigned the correct head, same to the unlabeled attachment score normally used in supervised parsing) as the evaluation metric and report the average DDA of 5 runs for each experiment. All the parameters of neural networks including language embeddings were randomly initialized and trained with learning rate 0.001, minibatch size 1000 and epoch 50. The dimension of the head token embedding and the child token embedding is set to 10. The shape of the weight matrix W dir is 20 × 10. The dimension of the valence embedding and language identity embedding is set to 5. For the auxiliary language identification task, we use a Bi-LSTM with hidden vector dimension of 10. 1 1 Our code is available at https://github.com/ WinnieHAN/mndmv.git.

Results
We first compare our method with two baseline methods, DMV and NDMV, which are similar to our method 2 . The baseline methods are experimented in both monolingual and multilingual settings. For the monolingual setting we trained the baseline models on each language independently. For the multilingual setting we trained them on the combined training data of all the 15 languages and tested on one of the languages. Table 2 shows the experimental results. It can be seen that our multilingual grammar model (G) performs better on average than all the baselines. The improvement be-comes more significant when our model is jointly trained with the auxiliary language identification task (G+I). Note that our approach performs worse than the monolingual baseline on some languages, and we speculate that it is partly caused by data imbalance. In particular, the worst-performing Hindi language has only 4997 training sentences, much smaller than the average 7532. It would be interesting to make training more balanced by assigning weights to training samples of different languages, which we leave for future work.
To measure the statistical significance of the advantage of our method, we performed the nonparametric Friedman's test to support/reject the claim (null hypothesis): there is no difference between the G+I model and the NDMV model in a multilingual setting. Based on the above sample data, the P-value 7.8911 × 10 −4 would result in rejection of the claim at the 0.05 significance level, thus showing the significance in our performance gain.
In Table 3 we compare our method with recent state-of-the-art approaches on the UD Treebank dataset: Convex-MST (Grave and Elhadad, 2015), LC-DMV (Noji et al., 2016) and D-J (Jiang et al., 2017). For the three approaches we use the results reported by Jiang et al. (2017). Our G+I model performs better than Convex-MST and LC-DMV on average, even though additional priors and delicate biases are integrated into the two methods (e.g, the universal linguistic prior for Convex-MST and the limited center-embedding for LC-DMV). Our method also slightly outperforms D-J on average, even though D-J combines Convex-MST and LC-DMV and therefore utilizes even more linguistic prior knowledge.

Visualization of Language Embeddings
One of our main expectations is that our approach can automatically learn language embeddings that capture similarities in typology between different languages. In order to verify our expectation, we collected the learned language embeddings and visualized them on a 2D plane using the t-SNE algorithm (Van der Maaten and Hinton, 2008). Figure 2 shows the visualization result. It can be seen that in most cases languages in the same language family are close to each other. For example, Finnish is close to Estonian (Finnic languages) and Slovenian is close to Bulgarian (Slavonic languages). It is also interesting to note that some   languages, such as English and Norwegian in the Germanic family and Latin in the Romance family, are closer to languages outside of their families than to their family siblings. We attribute this phenomenon to the difference in typology among some in-family languages: the flexible word order in German and Dutch is not shared by English and Norwegian; Latin, on the other hand, seems to share more common typological features with its classical counterpart, ancient Greek, than with its modern phylogenetic relatives in the Romance family. Such differences cannot be inferred from linguistic phylogeny, but our language embeddings have undoubtedly captured them.

In-family vs. Cross-family
In order to further examine the effectiveness of our model in coupling grammar parameters between languages regardless of their language families, we design an additional experiment on a bilingual  setup. Specifically, for each pair of languages, we tested our approach and the baseline of training DMV on the combined training set. Since in this setting DMV is blind to the language identity, we expect that it would perform poorly if the two languages come from different families and hence are very likely to have large difference. On the other hand, our model would not be as sensitive to the difference between the two languages. In Figure 3, we report the difference between the DDA of our model and that of DMV. It shows that the advantage of our model over DMV is more significant for cross-family language pairs than for in-family language pairs, which verifies our expectation.

Conclusion
In this paper, we incorporate continuous language identity representations into multilingual grammar induction, which softly tie grammar parameters from different languages, resulting in substantial performance gain over various baseline methods. Analysis of the language embeddings suggests that our approach may capture information about language similarity beyond linguistic phylogenetic knowledge.
While in this work we follow previous work and perform unlexicalized parsing, the proposed model can be extended for lexicalized parsing by replacing POS tag embeddings with cross-lingual word embeddings, which we leave for future work.