Estimating POS Annotation Consistency of Different Treebanks in a Language

We introduce a new symmetric measure (called θ pos ) that utilises the non-symmetric KL cpos 3 measure (Rosa and Žabokrtský, 2015) to allow us to compare the annotation consistency between different treebanks of a given language, annotated under the same guidelines. We can set a threshold for this new measure so that a pair of treebanks can be considered harmonious in their annotation if θ pos does not surpass the threshold. For the calculation of the threshold, we estimate the effects of (i) the size variation, and (ii) the genre variation in the considered pair of treebanks. The estimations are based on data from treebanks of distinct language families, making the threshold less dependent on the properties of individual languages. We demonstrate the utility of the proposed measure by listing the treebanks in Universal Dependencies version 2.5 (UDv2.5) (Zeman et al., 2019) data that are annotated consistently with other treebanks of the same language. However, the measure could be used to assess inter-treebank annotation consistency under other (non-UD) annotation guidelines as well.


Introduction
There exist a multitude of treebanks for different languages (Zeman et al., 2014). As noted by Kakkonen (2006), there exist a variety of formats and annotation schemes even for the treebanks for the same language. As an example, two well known POS tagging schemes for English language include the POS tagging scheme of the Penn Treebank 1 (Marcus et al., 1994) and the Universal POS tagset (Petrov et al., 2012).
The Universal Dependencies (UD) Project (Nivre et al., 2016b;Nivre et al., 2020) was introduced in 2014 as a means of unifying all the novel features of different annotation formats as a universal annotation scheme consistent across different languages. It has since become a standard reference to compare scores relating to parser performance (Che et al., 2018;Martínez Alonso et al., 2017), study of language-specific features (Alzetta et al., 2018), and for dependency parsing shared tasks on UD .
UDv2.5 (Zeman et al., 2019) contains 157 treebanks in 90 languages, with multiple treebanks for some languages. Regardless of the differences in genre or the teams involved in building the treebanks, all treebanks of one language should be consistent with respect to the annotation guidelines, both intra and inter treebanks. However, this is often not the case, primarily because of the different sources of origin of the annotated data. The problem of determining the degree to which the different treebanks differ from each other has been studied in some detail over multiple years, but is not yet entirely solved.
The rest of the article is organised as follows. The literature relevant to the problem is discussed in Section 2, followed by a short introduction to the KL cpos 3 measure and a definition of the proposed measure in Section 3. Section 4 lists the constraints for choosing the dataset for the experiments as listed in Sections 5 and 6. The results of the experiments are summarised in Section 7. A discussion of the measure concludes the article in Section 8. The treebanks in UDv2.5 are marked for consistency or inconsistency of their POS annotation based on the proposed measure in Appendix A; Appendix B demonstrates the calculation of the measure for a concrete pair of treebanks.

Related Work
One of the most commonly used approaches to find inconsistencies in annotation is to train a high quality tagger or parser on the given training data, and evaluating the cases where the prediction from the trained model differs from the annotation of the test data. This approach can also be extended by bootstrapping different trained models, with the majority consensus being compared against the available annotation. Martínez Alonso and Zeman (2016) assessed the similarity of the Spanish treebanks in UDv1.3 (Nivre et al., 2016a) using dependency parsing. A high-efficiency parser was trained on one of the treebanks, and then tested on another. If a drop in parsing accuracy was more than what was intuitive, the treebanks were marked as not similar enough. The same technique was employed to evaluate the different Russian treebanks in UDv2.2  against each other by . It is worth stating here that the performance of the used tagger or parser may be a bottleneck, with the additional variables of the size and genre composition of the evaluated treebanks, among others. Furthermore, the acceptable variability in score in such cases depends on the architecture of the trained model, and is not comparable across different languages, or even when a different architecture is employed on the same data. Dickinson and Meurers (2003a;2003b) focus on finding an n-gram of tokens in the corpus that occurs in the same context (referred to as a variation nucleus) such that its different occurrences are annotated differently. Originally coined for continuous annotation, 2 the method was eventually adapted to look for inconsistencies in discontinuous annotation as well (Dickinson and Meurers, 2005).  compare the POS annotation consistency for several Korean treebanks by using the relative frequency of the individual POS tags, while also briefly mentioning the cause of the variation in their distribution. While such analysis is slightly helpful in terms of drawing a comparison, it does not consider the interaction of different POS tags with each other. To illustrate such interactions, an n-gram-based approach might be utilised.

KL cpos and Measure Definition
In a delexicalised cross-language parser transfer scenario, Rosa and Žabokrtský (2015) show that the KL-Divergence score of POS trigrams, referred to as KL cpos 3 , can be effectively used for selection of the source language.
where cpos 3 is a coarse-grained 3 POS tag trigram, and with count src (cpos 3 ) = 1 for each unseen trigram and a special value for cpos i−1 or cpos i+1 when cpos i lies on the sentence beginning or end.
Considering that a treebank of the same language (despite the differences in the genres 4 covered) should be a better fit for POS transfer than a treebank from another language, we employ a symmetric variant of KL cpos 3 , called θ pos , to assess the annotation consistency among the different treebanks of a language. θ pos is a non-negative divergence measure. However, the measure scores cannot be compared directly across different languages. For a languageindependent usage, there should be an empirical upper bound that needs to be placed on the θ pos scores. As long as the θ pos scores are lower than this empirical bound, the considered pair of treebanks can be considered harmonious in terms of their POS annotation. We denote this empirical upper bound by Θ pos . The measures θ pos and Θ pos are linked together in the following definition: Definition 1. Given two treebanks A and B, we say the treebanks are consistent in their POS annotation if the symmetric measure of their mutual divergence (given by θ pos ) is less than or equal to a threshold (given by Θ pos ). Formally, it can be represented as: where KL cpos 3 (P, Q) indicates the KL cpos 3 score of Q as an estimator for P.
Even though Θ pos is an empirical bound on the θ pos measure, the former is essentially a property of the latter. The empirical upper bound value would need to be estimated anew for a different set of annotation guidelines. In the remaining article, we estimate the empirical upper bound in a language-independent manner by looking at the influence of size of data, and the POS distribution in individual genres on θ pos in different UDv2.5 treebanks (Zeman et al., 2019).

Assumptions while Working with UD Data
The UD website 5 provides a star ranking of individual treebanks within each language. The ranking is calculated heuristically 6 , depending on multiple factors including the size of the treebank and the number of genres present in the data. The score also incorporates the output from the official UD validator 7 and from the search for known error types 8 in UDAPI (Popel et al., 2017). The treebank's compliance with the UD guidelines thus plays an important role in the score. While it is possible for a treebank to have a high score without being internally consistent, we assume that a treebank that adheres better to the guidelines also contains fewer inconsistencies. Therefore, we trust treebanks rated 3.5 stars or more (out of 5 stars).
Sometimes a whole treebank may not be sufficiently internally consistent because different genres have different distributions of POS n-grams. We may then require that the data belonging to one particular genre is annotated consistently.

Dataset Size and θ pos
The value of θ pos may depend on data size, as some POS trigrams may not be present in small datasets. We use k-fold cross validation to check the effect of presence or absence of POS trigrams in the data, based on the data size.

Experimental Setup
KL cpos 3 (tgt, src) is defined on distributions of trigrams found in tgt and src. The calculated scores (and consequently θ pos scores) are therefore affected by the presence or absence of the POS trigrams. In order to discount variability of θ pos because of genre distribution, we use data from a single genre (news). We take two UDv2.5 treebanks that have a large number of news sentences, high star ranking, and that belong to different language families: Czech-PDT (Indo-European, rated 4.5 stars) and Estonian-EDT (Uralic, rated 4 stars). For easier manipulation, we downsample the news data from either treebank as shown in Table 1.

Treebank
Genre Sentences Downsampled to Czech-PDT News 53,075 50,000 Estonian-EDT News 13,557 12,000 Table 1: Sentence Counts in the news genre in Czech-PDT and Estonian-EDT.
To check the effect of data size on θ pos , we run k-fold cross-validation on the downsampled data with different k-values. For each value of k, the downsampled data gets split to k folds, we select randomly one fold as test set and compute θ pos of each of the remaining k − 1 folds and the test set. This way we obtain k − 1 values of θ pos ; their average is the θ pos value we report for the given k in Table 2.
In addition to finding the values of θ pos , we are also interested in finding its relationship with the count of unique trigrams common to the pair of distributions. We define coverage for a fold as the count of unique trigrams common to both training and test sets in the fold, expressed as a ratio of the count of all unique trigrams in the larger training set.  While there exists a strong negative correlation (Pearson correlation coefficient, r = -0.9075 and -0.9252 in Tables 2a, 2b respectively) between coverage of POS trigrams and the θ pos scores, the coverage is, however, dependent on the size of the datasets being compared. Figures 1a and 1b show the variability in (i) number of distinct POS trigrams, and (ii) total number of POS trigrams, as the data size changes.

Experimental Scores and Inference
As evident from the figures, the growth pattern of counts is similar in both languages. The POS trigrams in a small part of the dataset obviously cannot be considered representative of those present in the entire dataset. Based on the observed coverage curve, we set 400 sentences 9 as the minimum size of a dataset whose consistency with another dataset is assessed.
However, difference in average sentence length is a factor that needs to be taken in account as well. If the two treebanks differ considerably in their average sentence length, then the size expressed in number of sentences does not reflect the number of tokens (and, consequently, the number of POS trigrams). For example, consider the Arabic treebanks in Table 3. If we take an equal number of sentences from Arabic-PUD and either of the other two treebanks, the total number of words will differ by a factor of almost 2.   Table 3: Average sentence lengths in Arabic treebanks. A syntactic word (node in the dependency tree) typically corresponds to a surface token but some tokens are split to multiple syntactic words.
Accommodating the dataset-size comparison, we can formally set the conditions such that the datasets can be compared amongst each other. Given two datasets A, B; the pair can be checked for annotation consistency if the following heuristic constraints are satisfied: 1. Individual dataset has at least 400 sentences, i.e. ( ; and 2. Dataset with smaller average sentence length has at least as many syntactic words as 400 sentences in the other dataset, i.e. ( From Table 2, when the test split is composed of 500 sentences (k = 100 for Czech; k = 24 for Estonian), the θ pos measure is ≈ 0.3. Considering that the larger values of k in either dataset do not satisfy heuristic constraint 1, we estimate the empirical upper bound of θ pos based on k = 100 (Czech) and k = 24 (Estonian), respectively.
When estimating Θ pos , we do not want to be too restrictive because the observed θ pos ≈ 0.3 is based on internal consistency of a good treebank, which will be very hard to match for consistency between two different treebanks. We, therefore, round off the maximum observed θ pos score from ≈ 0.3 to 0.5. Formally, if the datasets A, B contain data from the same genre, and the size of the datasets is comparable (as per heuristic constraints defined before), the upper limit on the θ pos score can be specified in Equation 4. In the previous experiments we assumed that the two compared datasets consist of the same language and genre. It is likely that the distribution of POS trigrams will differ when the two datasets consist of different genres. We now proceed to investigate cross-genre variability inside a treebank that we believe is reasonably internally consistent. We are looking for Θ pos thresholds that could be used to assess annotation similarity of two treebanks that differ in genre.

Inter-Genre Similarity
The Polish-LFG treebank in UDv2.5 (rated 4 stars) contains data from different genres, 10 the counts of which are shown in Table 4a. Table 4b shows the genres in UDv2.5 Finnish-TDT treebank (rated 3.5 stars). In this case, the data labeled europarl and uni_articles (university articles) is kept separate and not used in the estimation of variability of θ pos across genres.  As can be seen from Table 5, the different genres in Finnish-TDT are internally consistent in their annotation, as per the constraint in Equation 4. For each genre source, the dataset is downsampled to 900 sentences, and the results are presented on the individual folds resulting from 2-fold cross-validation on the downsampled data. The similar analysis for genres in Polish-LFG is omitted here because the social genre does not have enough data.

Experimental Setup
We compare different genres in the Polish-LFG and Finnish-TDT treebanks by presenting the θ pos scores for each pair of genres (as per Table 4). Each genre is downsampled to the number of instances as listed in Table 6 such that the heuristic constraints for dataset comparison are satisfied.

Experimental Scores and Inference
Tables 7 and 8 list the θ pos scores for data from Polish-LFG and Finnish-TDT, respectively. It is worth noting that for most genre pairs, the Θ pos constraint as employed in Equation 4 is not enough, as θ pos frequently surpasses the imposed limit of 0.5.    Table 8: θ pos scores (± standard deviation) averaged over 100 runs for inter-genre analysis in downsampled UDv2.5 Finnish-TDT data. Each run results in a different downsample.

Genre (X) Downsampled to
As expected, we need a higher threshold when comparing datasets whose genre does not match. While a threshold of 1.6 would accommodate data in Polish-LFG and Finnish-TDT, we again allow some room to reduce false alarms about inconsistent pairs of treebanks, and frame the empirical upper bound on θ pos between genre x in dataset A (written as A x ) and genre y in dataset B (B y ) as in Equation 5, given below:

Combination of Genres
We denote the set of genres in treebank X as G X . Given two treebanks with at least one different genre, the different genres in the two treebanks can interact in either of the three cases as shown in Figure 2. To see how the θ pos scores are affected in either of the cases, we experiment with the data from UDv2.5 Polish-LFG.

Experimental Setup
We start by downsampling the data from the fiction and news genres to 2000 sentences each.
Using 2-fold cross-validation, the downsampled data is then split into 2 halves, termed as base and test set for the genre. In addition, we downsample the data from the spoken genre to 1000 sentences and use it as a test set (without corresponding base set). We try to understand θ pos variability in the scenarios depicted in Figure 2. The different genres combining together to form a dataset can be identified by the name of the concatenated dataset. The trailing base in the dataset name marks that it is composed of data from the base set of the genre(s). The datasets using test set of genre(s) can similarly be identified by trailing test in the dataset name.

Experimental Scores and Inference
We present the calculated scores for different cases in Table 9.  It is noteworthy that the decomposition of a treebank into its constituent genres forms the first basis for the study of variance of θ pos scores with a combination of the different genres. Upon a closer inspection, it was discovered that when there are multiple genres present in the treebank, the θ pos measure score is dominated by the POS trigrams that are typical of the language, and the genre-specific POS trigrams become increasingly obscure.
Once the individual genres have been identified and checked for the inter-genre θ pos scores, the overall measure score is less than the average of the measure scores calculated for individual pair of genres in the treebank(s). Formally, assuming treebanks A and B can be split into their constituent genres such that G A = {A 1 , A 2 , ..., A i } and G B = {B 1 , B 2 , ..., B j }, the overall limit on the θ pos (A, B) score can be specified as in Equation 6.

Adulterant Genres
In our analysis so far, we have restricted ourselves to instances where the data in the different genres could be reliably compared. We define a genre in the dataset as adulterant if the number of sentences in the genre does not satisfy either or both the constraints pertaining to dataset comparison. In this subsection, we take a look at how the presence of adulterant genres affects the θ pos scores.

Experimental Setup
To study the effect of adulterant genres, we first downsample data from the fiction, news and spoken genres in Polish-LFG to 500, 500 and 600 sentences respectively. For adulterant genres, we work with the data from the academic, blog and legal genres. The data from all the adulterant genres is concatenated to form a dataset labeled others. Non-adulterant genres are then combined with adulterant genres to result in a dataset identified as X-Y, where X contains data from news, or fiction, a combination of the two genres. Y may be an individual adulterant genre, or a combination of all adulterant genres (others). All the datasets created from the downsampled data are compared with the downsampled data from spoken.

Experimental Scores and Inference
The calculated θ pos scores for each pair, averaged over 100 runs, are reported in Table 10.  From the table, we observe that a low number of adulterant genres in the data does not affect the θ pos scores heavily. However, the presence of multiple adulterant genres pushes the θ pos scores by almost 1.5 as compared to when there are no adulterants present. Taking into account also the standard deviation score, and the high annotation quality of the treebank, we can add a headroom of +2.0 if adulterant genres are present.
Formally, assuming treebanks A and B can be split into their constituent genres such that

Framing the Overall θ pos Limit
In a case when the data from individual genres in the data is not annotated consistently, the θ pos score might be within the bounds of averaged scores for individual genres, therefore marking the pair as consistent. To avoid this, we calculate the idealistic Θ ′ pos as the average of Θ pos values for the genres.
where Θ pos (A x , B x ) = 0.5 and Θ pos (A x , B y ) = 2.0 as per Equations 4 and 5, respectively. For overall calculation of Θ pos scores for treebanks with multiple genres, the overall computation can be given by: where θ pos (A x , B y ) refers to the θ pos score calculated between genre x present in treebank A and genre y present in treebank B.
Regardless of the genre composition of the treebanks under consideration, the treebanks with θ pos ≤ 0.5 are termed as consistent in their POS annotation. Similarly, the treebanks with θ pos ≥ 4.0 are termed as inconsistent in their POS annotation. In case of multiple genres present in either treebank, Equation 9 can be employed if just the percentage composition of different genres in the treebanks is known, regardless of whether it is possible to split the treebank into the constituent genres. However, for a fine-tuned estimation, it is imperative to be able to split the treebank into its constituent genres.
For treebanks with adulterant genres, the higher Θ pos limit on the θ pos scores can be problematic. If possible, the adulterated genres should be isolated and the annotation consistency of the treebank should be checked without presence of any adulterant genre(s).

Using θ pos to Localise Inconsistency
While the θ pos measure is primarily meant to identify whether two given treebanks are consistent in their POS annotation, the measure can also be employed to localise points of inconsistency, if required.
Consider Consequently, concentrating simply on the instances from this genre should be enough to bring the overall θ pos score between the two treebanks under the Θ pos limit.

Split into Constituent Genres as a Requirement
The estimation of Θ pos is primarily based on the requirement that the genre composition of treebanks is known. While the limit is best estimated when the genres can be isolated and the adulterant genres identified, it is possible to get a crude estimate of the limit. For example, one can estimate all the common genres with θ pos scores of 0.5, and the different genres have a θ pos score of 2.0. An average of these estimates should give a crude estimate on the Θ pos limit without accounting for an adulterant genre. Data with multi-genre classification can also be handled in a similar manner.

Conclusion
We proposed a numeric measure based on the KL cpos 3 measure (Rosa and Žabokrtský, 2015) to attest the POS annotation consistency across treebanks that allegedly follow the same guidelines, for the same language. Through the use of the measure, we sought to answer how the different treebanks of a language, with variable size and genre distributions but following the same annotation guidelines, can be compared against each other. We also defined a reliable threshold on the proposed measure that would inform the annotators if the treebanks being compared are not consistent with each other. In addition, the measure can also be used intratreebank to localize the genre(s) that cause the inconsistency with another treebank. We also evaluated different treebanks in UDv2.5 (Zeman et al., 2019) and identified the consistent and inconsistent treebank pairs based on the proposed measure. To the best of our knowledge, this is the first such measure that compares treebanks directly, without an added variable of tagger performance. At present, the measure does not allow checking for consistency in treebanks with syntactic annotation. Perhaps similar ideas might lead to a syntactic version of the measure in the future.
A Appendix A: θ pos Scores for UDv2.5 Treebanks, Annotated to Mark

Consistent and Inconsistent Treebanks
This appendix lists the θ pos scores in the UDv2.5 data (Zeman et al., 2019) with the annotations used as per   Table 14 marks the Θ pos limit for treebanks that were marked as inconsistent in the table above. We omit the Θ pos limit for Ancient_Greek treebanks, since the reported θ pos score for the treebanks in the language exceed the hard limit of 4.0.   Table 13 There are a few important points that need to be specified here: 1. The affiliation of individual sentences in any given treebank is optional and not standardized. If the README.md file associated with a treebank in question does not specify how to split the treebank into the constituent genres, the information can be queried through the data providers of the treebank in question. Turkish-IMST could not be assessed for the annotation consistency with the other Turkish treebank as the information on their genre split could not be fetched through either source.