Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes UALing’s approach to the CoNLL 2017 UD Shared Task using corpus selection techniques to reduce training data size. The methodology is simple: we use similarity measures to select a corpus from available training data (even from multiple corpora for surprise languages) and use the resulting corpus to complete the parsing task. The training and parsing is done with the baseline UDPipe system (Straka et al., 2016). While our approach reduces the size of training data significantly, it retains performance within 0.5% of the baseline system. Due to the reduction in training data size, our system performs faster than the naïve, complete corpus method. Specifically, our system runs in less than 10 minutes, ranking it among the fastest entries for this task. Our system is available at https://github.com/CoNLL-UD-2017/UALING.


Introduction
Universal Dependencies (UDs) (Nivre et al., 2016) includes corpora from different languages annotated with identical types of labels. This allows for the examination of different theoretical (Schuster and Manning, 2016) and practical applications such as the CoNLL 2017 UD Shared Task (Zeman et al., 2017). 1 The specific practical task presented here involves using these corpora in a supervised learning approach in order to achieve the task's goal: Training with the multilingual UD data in order to find dependency relationships not just for 1 http://universaldependencies.org/ conll17 these known languages, but also for unknown or little-known language. 2

Theoretical Concepts
Supervised learning occurs when humans encode their judgment into a set of data, which is in turn used to train statistical models with the ultimate goal of using these models to make accurate predictions for previously unseen datasets-which are often too large (and costly) or otherwise unavailable for humans to judge manually. Building these models of human judgment is necessary in cases where explicit rules are too complex to encode, ambiguous, or where such rules are not known; rather than explicitly and programatically encoding rules, supervised learning models "learn" or at least "contain" the rules through models generated from human-judged data. Ideally, the models are used to apply those same rules to the unseen datasets.
The rules contained in human-judged training data are, necessarily, constrained to the domain from which the data derives. 3 Unknown domains-such as unknown languages-are difficult to handle, because the rules from another known domain do not necessarily apply and the rules for the target domain are thus not readily known. Creating new corpora for specific domains (which occurs often for biomedical-domain data, for instance) drastically improves accuracy for the given corpus domain (Clegg and Shepherd, 2005). The UD corpora extend training data across numerous different language-domains. However, adapting currently known data to new domains (including new languages) is a difficult problem, particularly when human judgment is not available to aid in the adaption.
Similarly, in many cases training data contains rules not relevant or even contradictory to those in unseen data. This occurs both interdomainwhere the training data contains data from a domain which is not relevant to the unseen data-and intradomain-where training data from a domain is not relevant to other data in the same domain. 4 This data may not be necessary to building supervised learning models because it does not contain relevant rules; in some circumstances, training data may even introduce rules to the model which contradict rules in the unseen data. 5 Still other training data contains rules of marginal significance to the model, where such rules apply only to an extremely small segment of unseen data.
Without using supervised learning methods which actively adapt the rules of these data types to incoming unseen data, it is possible to (1) improve algorithmic and model performance by removing contradictory-rule training data, (2) improve algorithmic performance without model performance loss by removing irrelevant-rule training data, and (3) improve algorithmic performance without significant performance loss by removing marginal-significance rule data.

Resulting Methodology
For this paper, we introduce several methods of automated corpus refinement in order to improve and at best optimize supervised learning by accomplishing the goals enumerated above. Specifically, we propose and evaluate the use of similarity measures to refine the training data set; these similarity metrics "select" training data of the same domain and data which is closest in linguistic rules to the target, unseen data from available corpora. 4 Both interdomain and intradomain data are considered and dealt with by the method proposed here, though the task focuses on interdomain problems when considering different language domains. Pure corpus compression, as also discussed herein, tends to focus on the intradomain problem. 5 To illustrate, when considering supervised learning approaches to parts-of-speech annotating, in the domain of formal scientific literature the word "as" might more often be used as a conjunction (as a synonym of "because") while in journalism it might be more often used as an adverb. Models trained on these different domains would likely result in different outcomes when labeling parts-of-speech due to these differences.
Using only this similar training data ought to remove data containing contradictory or irrelevant rules. Furthermore, similarity metrics provide an opportunity to scale the included data by including only the most similar data above a threshold, which also has the potential to remove marginalsignificance training data. This method allows us to drastically reduce corpus size while retaining only the most similar-and, ideally, best-training data. The overall effect on supervised learning performance depends on how well the employed similarity metric matches underlying rule similarity.
To accomplish this, we create a corpus processing pipeline in Java which calculates similarity and selects data. In this implementation, the development data set for monolingual parsing is considered a feature vector ( §3), and similarity ( §2) is calculated between this vector and each sentence in the single monolingual training data of the language. We try to find a fixed selection threshold (where sentences above the threshold are kept in the new training dataset) for all languages for monolingual parsing that provides the greatest performance, though performance-per-compression metrics are also valuable in some contexts. The sample data set for surprise languages also constitutes a feature vector ( §4), and the program calculates the similarity between this vector and each sentence in the training data of all other languages. We evaluate various selection thresholds to adapt the under-resourced situation for each surprise language. We use UDPipe 1.1 (Straka et al., 2016) as the baseline system and the UD version 2.0 datasets . We are able to reduce the size of training data down to 76.25% of the original in average using the proposed method while retaining UD parsing results are comparable to the baseline system. We can actually increase result accuracies for certain languages by using the resulting compressed training datasets.

Similarity
In our methodology, we employ cosine similarity as our similarity metric. The cosine similarity measure is applied to two latent vectors in two datasets. Let cos(d 1 , d 2 ) be the cosine similarity, which is calculated as follows: where two feature vectors of V d 1 and V d 2 are Les commotions cérébrales sont devenu si courante dans ce sport qu' on les considére presque comme la routine . from training and development datasets. Among the 64 languages with training data, 56 provide development data as well. Therefore, we focus on 56 languages for the proposed corpus selection approaches. The entire development data set makes one vector, and then the similarity is calculated between this vector and every sentence in the training data. Various feature vectors are described in §3 and §4 for the monolingual and cross-lingual corpus selection approaches. For monolingual parsing, we use training and development corpora of the single language set for similarity measurement, extracting the most pertinent training data from the single corpus in order to compress and/or refine it. For cross-lingual parsing when we deal with surprise languages, we use training corpora from all languages, comparing the target language data to all known UD language corpora. This extracts the most similar data from other languages, with the hope that it is also similar in language grammar and structure-and, hence, similar in annotation.

Monolingual Corpus Selection Approaches
We use the following features for monolingual parsing: 1. tri-gram POS sequences 2. dependency relationships between two POS labels Tri-gram POS sequences represent the tri-gram universal POS labels (Petrov et al., 2012). Dependency relationships represents the part of speech labels of a dependent and a dependee, and their dependency relationships.

Feature extraction
Tri-gram POS sequences are extracted from Universal POS labels of the sentence such as DET NOUN ADJ, NOUN ADJ AUX, etc. (See Figure 1). Uni-gram and bi-gram POS sequences are excluded because we found them to not be distinctive between the languages with Universal POS labels that we examined. We also extract dependency relationships between POS labels for the similarity measure such as NOUN nsubj VERB for commotions ... devenu where commotions/NOUN is dependent on devenu/VERB with nsubj dependency relation. 6 Figure 2 shows two results by using the different thresholds for tri-gram POS sequences and POS-dep-POS. Using similarity measures to select the subset of the original training data, the proposed method slightly outperforms the results obtained by the original training data set with the similarity threshold θ = 0.1. Actually, it improves the parsing result by 0.01% and 0.15% only using 94% and 77% of the original training datasets for German and Dutch, respectively. Table 1 shows our entire results of the corpus selection method for monolingual parsing on the dev datasets using label attachment score (LAS) per treebank. We train the full training datasets, and trimmed datasets using similarity of tri-gram POS sequences (pos) and POS-dep-POS (dep). We also train the monolingual parsing models by using results the intersection of two similarity measures (intersection). All results are tested on the dev datasets without 8 languages which do not provide dev datasets. 7 Table 1 also shows results from the 6 While current feature selection is based on Universal POS labels, using language-specific POS labels for feature selection is one possible way to extend our approach for the monolingual corpus selection. 7 We also exclude results of ru syntagrus from the table because of internal formatting errors that our corpus selection method produced. method of length-based corpus selection (length). Since the sentence lengths decay after the peak (of the distribution of the numbers of sentences), our length-based approach to reduce the data set by length is to count the number of sentences before the peak and keep up to that many sentences after the peak. 8 In Table 1, we also indicate ratios of training datasets. This compression amount uses the scale indicated by the entire (full) and its compressed rates. For example, while grc uses 54.76% ( ) of the full training data set for length, its results decreases only by 2.54%. Actually, grc uses only 83.54% ( ) for intersection, it outperforms by 0.41%. We improve parsing results for 33 languages on dev data using the proposed corpus selection method by measuring similarity. 8 This method resulted in some languages having up to 80% of the sentences removed because the peak sentence length was a small number of words. In order to make sure that only the outliers in length are removed we change the algorithm so that the peak value was between ten and twenty words long. This fixes the problem where languages with a low peak length having a large amount of sentences removed.

Discussion
Besides features that we presented, we also investigate a length-based approach to select the training data. Instead of using the peak of the distribution of the numbers of sentences as in Table 1, we calculate the simple average numbers of words of the sentences in the dev data set. Then, we obtain the training data set using thresholds with average ± scale for the number of words, where scale is the number of max(|avg − max|, |avg − min|) words of the sentence in the dev data set. max and min are the maximum and the minimum numbers of words of the sentences. Figure  3 shows results and the number of sentences in the trimmed training datasets using length thresholds. We vary scale multipliers from 1 to 5. Filtering based on sentence length can potentially remove unnecessary size from the training data and possibly remove some inaccuracy, assuming that longer sentences increase entropy and become inherently less predictable. However, as Figure 3 indicated this simple length-based approach cannot keep up with the baseline results. While corpus compression levels compare to the similarity-  Table 1: Monolingual corpus selection results on dev datasets. The numerical entries are LAS, and the bar indicates corpus compression amount. Length-based is trimmed based on the length of sentences in training data. Tri-gram POS sequences and POS-relation-POS are trimmed based on similarities between full training and dev data. intersection applies two feature extraction together (POS sequences and POSrelation-POS trimming). Threshold is fixed for all languages (0.1). We also indicate ratios of trimming of training datasets alongside parsing results. based approaches, parsing results drop significantly. The empirical reasons that naive lengthbased approaches do not work well may be worth further consideration, but as a general matter the length metric is overly simplistic and may omit significant amounts of pertinent training data; similarity metrics, by contrast, attempt to retain the most pertinent data.

Cross-lingual Similarities for Surprise Language Parsing
We use the same similarity measures to identify the training data of the surprise language. Since surprise languages are provided without training data, we select the training datasets from training datasets of all languages by calculating similarities with sample datasets of surprise languages. Figure 4 shows threshold estimation for surprise languages. We fix the similarity threshold at 0.3 for tri-gram POS sequences because larger thresholds provide too little training data, and the smaller thresholds do not compress adequately. Thus, we tune based on resulting corpus size. We vary similarity threshold between 0.7 and 0.9 for POS-dep-POS. 0.3 for tri-gram POS sequences and 0.7 for POS-dep-POS both result in a size of about 25% of the training data set for monolingual corpus selection.

Results
For the submitted official results through TIRA (Potthast et al., 2014), we use the intersection model for all languages. Since we focus on the corpus selection, we do not perform additional preprocessing and we use the provided training datasets as they are. We fix 0.1 both for trigram POS sequences and POS-dep-POS because it gives the best results for dev datasets for monolingual training. We fix 0.3 for tri-gram POS sequences and we use the thresholds described in Table 2 for POS-dep-POS to select training datasets from all languages for surprise language parsing. We provide the basic parsing model for PUD treebanks, for example, we use cs parsing model for bxr hsb kmr sme 0.8 0.9 0.8 0.7

Discussion and Conclusion
In this paper, we introduced the idea of refining the training datasets to UD parsing and cross-lingual parsing to select training datasets from the same language and other languages, respectively. While our approach reduced the size of training data significantly, we retained performance within 0.5% of the baseline system. Additionally, corpus refinement methods can also be of utmost importance in trimming the size of training data for algorithmically intense algorithms or large scale system deployment runtime performance. A total runtime on entire treebanks is only around 10min with a default setting, which is fast enough; additional optimization may improve this. The size of parsing models is smaller than the baseline because we use only the subset of the entire training datasets. Even though we don't use any external data, our final results are competitive to the baseline system even with smaller datasets. The current results presented here show promise, and there exists potential for further refinement by, for instance, using different similarity metrics. Exploring different similarity metrics may further enhance performance for other NLP tasks as well as UD parsing.  Table 4: LAS results per treebank. We highlight the score where we can improve the results compared to the baseline system.