Working Hard or Hardly Working: Challenges of Integrating Typology into Neural Dependency Parsers

This paper explores the task of leveraging typology in the context of cross-lingual dependency parsing. While this linguistic information has shown great promise in pre-neural parsing, results for neural architectures have been mixed. The aim of our investigation is to better understand this state-of-the-art. Our main findings are as follows: 1) The benefit of typological information is derived from coarsely grouping languages into syntactically-homogeneous clusters rather than from learning to leverage variations along individual typological dimensions in a compositional manner; 2) Typology consistent with the actual corpus statistics yields better transfer performance; 3) Typological similarity is only a rough proxy of cross-lingual transferability with respect to parsing.


Introduction
Over the last decade, dependency parsers for resource-rich languages have steadily continued to improve. In parallel, significant research efforts have been dedicated towards advancing crosslingual parsing. This direction seeks to capitalize on existing annotations in resource-rich languages by transferring them to the rest of the world's over 7,000 languages (Bender, 2011). The NLP community has devoted substantial resources towards this goal, such as the creation of universal annotation schemas, and the expansion of existing treebanks to diverse language families. Nevertheless, cross-lingual transfer gains remain modest when put in perspective: the performance of transfer models can often be exceeded using only a handful of annotated sentences in the target language (Section 5). The considerable divergence of language structures proves challenging for current models.
One promising direction for handling these divergences is linguistic typology. Linguistic typology classifies languages according to their structural and functional features. By explicitly highlighting specific similarities and differences in languages' syntactic structures, typology holds great potential for facilitating cross-lingual transfer (O'Horan et al., 2016). Indeed, nonneural parsing approaches have already demonstrated empirical benefits of typology-aware models (Naseem et al., 2012;Täckström et al., 2013;Zhang and Barzilay, 2015) While adding discrete typological attributes is straightforward for traditional feature-based approaches, for modern neural parsers finding an effective implementation choice is more of an open question. Not surprisingly, the reported results have been mixed. For instance, Ammar et al. (2016) found no benefit to using typology for parsing when using a neuralbased model, while Wang and Eisner (2018) and Scholivet et al. (2019) did in several cases.
There are many possible hypotheses that can attempt to explain the state-of-the-art. Might neural models already implicitly learn typological information on their own? Is the hand-specified typology information sufficiently accurate -or provided in the right granularity -to always be useful? How do cross-lingual parsers use, or ignore, typology when making predictions? Without understanding answers to these questions, it is difficult to develop a principled way for robustly incorporating linguistic knowledge as an inductive bias for cross-lingual transfer.
In this paper, we explore these questions in the context of two predominantly-used typologybased neural architectures for delexicalized dependency parsing. 2 The first method implements a variant of selective sharing (Naseem et al., 2012); the second adds typological information as an additional feature of the input sentence. Both models are built on top of the popular Biaffine Parser (Dozat and Manning, 2017). We study model performance across multiple forms of typological representation and resolution.
Our key findings are as follows: • Typology as Quantization Cross-lingual parsers use typology to coarsely group languages into syntactically-homogeneous clusters, yet fail to significantly capture finer distinctions or typological feature compositions. Our results indicate that they primarily take advantage of the simple geometry of the typological space (e.g. language distances), rather than specific variations in individual typological dimensions (e.g. SV vs. VS).
• Typology Quality Typology that is consistent with the actual corpus statistics results in better transfer performance, most likely by capturing a better reflection of the typological variations within that sample. Typology granularity also matters. Finer-grained, high-dimensional representations prove harder to use robustly.
• Typology vs. Parser Transferability Typological similarity only partially explains crosslingual transferability with respect to parsing. The geometry of the typological space does not fully mirror that of the "parsing" space, and therefore requires task-specific refinement.

Typology Representations
Linguistic Typology, T L : The standard representation of typology is sets of annotations by linguists for a variety of language-level properties. These properties can be found in online databases such as The World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013). We consider the same subset of features related to word order as used by Naseem et al. (2012), represented as a k-hot vector T ∈ {0, 1} f |V f | , where V f is the set of values feature f may take. Liu Directionalities, T D : Liu (2010) proposed using a real-valued vector T ∈ [0, 1] r of the average directionalities of each of a corpus' r dependency relations as a typological descriptor. These serve as a more fine-grained alternative to linguistic typology. Compared to WALS, there are rarely missing values, and the degree of dominance of each dependency ordering is directly encoded -potentially allowing for better modeling of local variance within a language. It is important to note, however, that true directionalities require a parsed corpus to be derived; thus, they are not a realistic option for cross-lingual parsing in practice. 3 Nevertheless, we include them for completeness. Surface Statistics, T S : It is possible to derive a proxy measure of typology from part-ofspeech tag sequences alone. Wang and Eisner (2017) found surface statistics to be highly predictive of language typology, while Wang and Eisner (2018) replaced typological features entirely with surface statistics in their augmented dependency parser. Surface statistics have the advantage of being readily available and are not restricted to narrow linguistic definitions, but are less informed by the true underlying structure. We compute the set of hand-engineered features used in (Wang and Eisner, 2018), yielding a real-valued vector T ∈ [0, 1] 2380 .

Parsing Architecture
We use the graph-based Deep Biaffine Attention neural parser of (Dozat and Manning, 2017) as our baseline model. Given a delexicalized sentence s consisting of n part-of-speech tags, the Biaffine Parser embeds each tag p i , and encodes the sequence with a bi-directional LSTM to produce tag-level contextual representations h i . Each h i is then mapped into head-and child-specific representations for arc and relation prediction, h , and h rel-head i , using four separate multi-layer perceptrons.
For decoding, arc scores are computed as: while the score for dependency label r for edge (i, j) is computed in a similar fashion: Both s arc ij and s rel (i,j),r are trained greedily using cross-entropy loss with the correct head or label. At test time the final tree is composed using the Chu-Liu-Edmonds (CLE) maximum spanning tree algorithm (Chu and Liu, 1965;Edmonds, 1967).  (Wang and Eisner, 2018). B * and +T * S are the baseline and surface statistics model results, respectively, of (Wang and Eisner, 2018). 4 Fine-tune is the result of adapting our baseline model using only 10 sentences from the target language. All of our reported numbers are the average of three runs with different random seeds. Results with differences that are statistically insignificant compared to the baseline are marked with † (arc-level paired permutation test with p ≥ 0.05).

Typology Augmented Parsing
Selective Sharing: Naseem et al. (2012) introduced the idea of selective sharing in a generative parser, where the features provided to a parser were controlled by its typology. The idea was extended to discriminative models by Täckström et al. (2013). For neural parsers which do not rely on manually-defined feature templates, however, there isn't an explicit way of using selective sharing. Here we choose to directly incorporate selective sharing as a bias term for arc-scoring: where v is a learned weight vector and f ij is a feature vector engineered using Täckström et al.'s head-modifier feature templates (Appendix B).
Input Features: We follow Wang and Eisner (2018) and encode the typology for language l with an MLP, and concatenate it with each input: This approach assumes the parser is able to learn to use information in T (l) ∈ {T (l) S } to induce some distinctive change in encoding h.

Experiments
Data: We conduct our analysis on the Universal Dependencies v1.2 dataset (Nivre et al., 2015) and follow the same train-test partitioning of languages as Wang and Eisner (2018). We train on 20 treebanks and evaluate cross-lingual performance on the other 15; test languages are shown in Table 1. 5 We perform hyper-parameter tuning via 5-fold cross-validation on the training languages. Results: Table 1 presents our cross-lingual transfer results. Our baseline model improves over the benchmark in (Wang and Eisner, 2018) by more than 6%. As expected, using typology yields mixed results. Selective sharing provides little to no benefit over the baseline. Incorporating the typology vector as an input feature is more effective, with the Liu Directionalities (T D ) driving the most measurable improvements -achieving statistically significant gains on 13/15 languages. The Linguistic Typology (T L ) gives statistically significant gains on 10/15 languages. Nevertheless, the results are still modest. Fine-tuning on only 10 sentences yields a 2.3× larger average UAS increase, a noteworthy point of reference. 4 Wang and Eisner (2018)'s final T * S also contains additional neural features that we omitted, as we found it to underperform using only hand-engineered features. 5 Two treebanks are excluded from evaluation, following the setting of Wang and Eisner (2018).  Figure 1: t-SNE projection of WALS vectors with clustering. Persian (fa) is an example of a poorly performing language that is also far from its cluster center.

Analysis
Typology as Quantization: Adding simple, discrete language identifiers to the input has been shown to be useful in multi-task multi-lingual settings (Ammar et al., 2016;Johnson et al., 2017).
We hypothesize that the model utilizes typological information for a similar purpose by clustering languages by their parsing behavior. Testing this to the extreme, we encode languages using onehot representations of their cluster membership. The clusters are computed by applying K-Means 6 to WALS feature vectors (see Figure 1 for an illustration). In this sparse form, compositional aspects of cross-lingual sharing are erased. Performance using this impoverished representation, however, only suffers slightly compared to the originaldropping by just 0.56% UAS overall and achieving statistically significant parity or better with T L on 7/15 languages. A gap does still partially remain; future work may investigate this further. This phenomenon is also reflected in the performance when the original WALS features are used. Test languages that do belong to compact clusters have higher performance on average than that of those who are isolates (e.g., Persian, Basque). Indeed from Table 1 and Fig. 1 we observe that the worst performing languages are isolated from their cluster centers. Even though their typology vectors can be viewed as compositions of training languages, the model appears to have limited generalization ability. This suggests that the model does not effectively use individual typological features.
This can likely be attributed to the training routine, which poses two inherent difficulties: 1) the parser has few examples (entire languages) WALS ID 82A 83A 85A 86A 87A 88A   Logreg  87  85  97  92  94  92  Majority  61  56  87  75  51  82   Table 2: Performance of typology prediction using hidden states of the parser's encoder, compared to a majority baseline which predicts the most frequent category.  Table 3: Average UAS results when training with Galactic Dependencies. The Linguistic Typology (T ‡ L ) here is computed directly from the corpora using the rules in Appendix E. All of our reported numbers are the average of three runs.
to generalize from, making it hard from a learning perspective and 2) a naïve encoder can already implicitly capture important typological features within its hidden state, using only the surface forms of the input. This renders the additional typology features redundant. Table 2 presents the results of probing the final max-pooled output of the BiLSTM encoder for typological features on a sentence level. We find they are nearly linearly separable -logistic regression achieves greater than 90% accuracy on average. Wang and Eisner (2018) attempt to address the learning problem by using the synthetic Galactic Dependencies (GD) dataset (Wang and Eisner, 2016) as a form of data augmentation. GD constructs "new" treebanks with novel typological qualities by systematically permuting the behaviors of real languages. Following their work, we add 8, 820 GD treebanks synthesized from the 20 UD training languages, giving 8, 840 training treebanks in total. Table 3 presents the results of training on this setting. While GD helps the weaker T * S substantially, the same gains are not realized for models built on top of our stronger baselinein fact, the baseline only narrows the gap even further by increasing by 0.92% UAS overall. 7 Typology Quality: The notion of typology is predicated on the idea that some language features are consistent across different language samples, yet in practice this is not always the case. For instance, Arabic is listed in WALS as SV (82A, Subject Verb), yet follows a large number of Verb Subject patterns in UD v1.2. Fig. 2  ther demonstrates that for some languages these divergences are significant (see Appendix F for concrete examples). Given this finding, we are interested in measuring the impact this noise has on typology utilization. Empirically, T D , which is consistent with the corpus, performs best. Furthermore, updating our typology features for T L to match the dominant ordering of the corpus yields a slight improvement of 0.21% UAS overall, with statistically significant gains on 7/15 languages. In addition to the quality of the representation, we can also analyze the impact of its resolution. In theory, a richer, high-dimensional representation of typology may capture subtle variations. In practice, however, we observe an opposite effect, where the Linguistic Typology (T L ) and the Liu Directionalities (T D ) outperform the surface statistics (T S ), with |T L | ≈ |T D | |T S |. This is likely due to the limited number of languages used for training (though training on GD exhibits the same trend). This suggests that future work may consider using targeted dimensionality reduction mechanisms, optimized for performance.
Typology vs. Parser Transferability: The implicit assumption of all the typology based methods is that the typological similarity of two languages is a good indicator of their parsing transferability. As a measure of parser transferability, for each language we select the oracle source language which results in the best transfer performance. We then compute precision@k for the nearest k neighbors in the typological space, i.e. whether the best source appears in the k nearest neighbors. As shown in Table 4, we observe that while there is some correlation between the two, they are far from perfectly aligned. T D has the best alignment, which is consistent with its corresponding best parsing performance. Overall, this divergence motivates the development of approaches that better match the two distributions. P@1 P@3 P@5 P@10   TL  13  33  60  80  TD  27  67  67  93  TS  13  27  27  73   Table 4: Precision@k for identifying the best parsing transfer language, for the k typological neighbors.

Related Work
Other recent progress in cross-lingual parsing has focused on lexical alignment (Guo et al., 2015(Guo et al., , 2016Schuster et al., 2019). Data augmentation (Wang and Eisner, 2017) is another promising direction, but at the cost of greater training demands. Both directions do not directly address structure. Ahmad et al. (2019) showed structuralsensitivity is important for modern parsers; insensitive parsers suffer. Data transfer is an alternative solution to alleviate the typological divergences, such as annotation projection (Tiedemann, 2014) and source treebank reordering (Rasooli and Collins, 2019). These approaches are typically limited by parallel data and imperfect alignments. Our work aims to understand cross-lingual parsing in the context of model transfer, with typology serving as language descriptors, with the goal of eventually addressing the issue of structure.

Conclusion
Realizing the potential for typology may require rethinking current approaches. We can further drive performance by refining typology-based similarities into a metric more representative of actual transfer quality. Ultimately, we would like to design models that can directly leverage typological compositionality for distant languages.