Supervised Treebank Conversion: Data and Approaches

Treebank conversion is a straightforward and effective way to exploit various heterogeneous treebanks for boosting parsing performance. However, previous work mainly focuses on unsupervised treebank conversion and has made little progress due to the lack of manually labeled data where each sentence has two syntactic trees complying with two different guidelines at the same time, referred as bi-tree aligned data. In this work, we for the first time propose the task of supervised treebank conversion. First, we manually construct a bi-tree aligned dataset containing over ten thousand sentences. Then, we propose two simple yet effective conversion approaches (pattern embedding and treeLSTM) based on the state-of-the-art deep biaffine parser. Experimental results show that 1) the two conversion approaches achieve comparable conversion accuracy, and 2) treebank conversion is superior to the widely used multi-task learning framework in multi-treebank exploitation and leads to significantly higher parsing accuracy.


Abstract
Treebank conversion is a straightforward and effective way to exploit various heterogeneous treebanks for boosting parsing accuracy. However, previous work mainly focuses on unsupervised treebank conversion and makes little progress due to the lack of manually labeled data where each sentence has two syntactic trees complying with two different guidelines at the same time, referred as bi-tree aligned data.
In this work, we for the first time propose the task of supervised treebank conversion. First, we manually construct a bi-tree aligned dataset containing over ten thousand sentences. Then, we propose two simple yet effective treebank conversion approaches (pattern embedding and treeLSTM) based on the state-of-the-art deep biaffine parser. Experimental results show that 1) the two approaches achieve comparable conversion accuracy, and 2) treebank conversion is superior to the widely used multi-task learning framework in multiple treebank exploitation and leads to significantly higher parsing accuracy.

Introduction
During the past few years, neural network based dependency parsing has achieved significant progress and outperformed the traditional discrete-feature based parsing (Chen and Manning, 2014;Dyer et al., 2015; Zhou * The first two (student) authors make equal contributions to this work. Zhenghua is the correspondence author.
Meanwhile, motivated by different syntactic theories and practices, major languages in the world often possess multiple large-scale heterogeneous treebanks, e.g., Tiger (Brants et al., 2002) and TüBa-D/Z (Telljohann et al., 2004) treebanks for German, Talbanken (Einarsson, 1976) and Syntag (Järborg, 1986) treebanks for Swedish, ISST (Montemagni et al., 2003) and TUT 1 treebanks for Italian, etc. Table 1 lists several large-scale Chinese treebanks. In this work, we take HIT-CDT as a case study. Our next-step plan is to annotate bi-tree aligned data for PKU-CDT and then convert PKU-CDT to our guideline. For non-dependency treebanks, the straight-forward choice is to convert such treebanks to dependency treebanks based on heuristic head-finding rules. The second choice is to directly extend our proposed approaches by adapting the patterns and treeLSTMs for non-dependency structures, which should be straightforward as well.
Considering the high cost of treebank construction, it has always been an interesting and attractive research direction to exploit various heterogeneous treebanks for boosting parsing performance. Though under different linguistic theories or annotation guidelines, the treebanks are painstakingly developed to capture the syntactic structures of the same language, thereby having a great deal of common grounds.
Previous researchers have proposed two approaches for multi-treebank exploitation. On the one hand, the guiding-feature method projects the knowledge of the source-side treebank into the target-side treebank, and utilizes extra pattern-based features as guidance for the target-side parsing, mainly for the traditional discrete-feature based parsing . On the other hand, the multi-task learning method simultaneously trains two parsers on two treebanks and uses shared neural network parameters for representing common-ground syntactic knowledge (Guo et al., 2016). 2 Regardless of their effectiveness, while the guiding-feature method fails to directly use the source-side treebank as extra training data, the multi-task learning method is incapable of explicitly capturing the structural correspondences between two guidelines. In this sense, we consider both of them as indirect exploitation approaches.
Compared with the indirect approaches, treebank conversion aims to directly convert a source-side treebank into the target-side guideline, and uses the converted treebank as extra labeled data for training the targetside model. Taking the example in Figure 1, the goal of this work is to convert the under tree that follows the HIT-CDT guideline  into the upper one that follows our new guideline. However, due to the lack of bi-tree aligned data, in which each sentence has two syntactic trees following the sourceside and target-side guidelines respectively, most previous studies are based on unsupervised treebank conversion (Niu et al., 2009) or pseudo bi-tree aligned data (Zhu et al., 2011;Li et al., 2013), making very limited progress.
In this work, we for the first time propose the task of supervised treebank conversion.
The key motivation is to better utilize a largescale source-side treebank by constructing a small-scale bi-tree aligned data. In summary, we make the following contributions.
(1) We have manually annotated a highquality bi-tree aligned data containing over ten thousand sentences, by reannotating the HIT-CDT treebank according to a new guideline.
(2) We propose a pattern embedding conversion approach by retrofitting the indirect guiding-feature method of  to the direct conversion scenario, with several substantial extensions.
(3) We propose a treeLSTM conversion approach that encodes the source-side tree at a deeper level than the shallow pattern embedding approach.
Experimental results show that 1) the two conversion approaches achieve nearly the same conversion accuracy, and 2) direct treebank conversion is superior to indirect multi-task learning in exploiting multiple treebanks in methodology simplicity and performance, yet with the cost of manual annotation. We release the annotation guideline and the newly annotated data in http://hlt.suda.edu.cn/ index.php/SUCDT.

Annotation of Bi-tree Aligned Data
The key issue for treebank conversion is that sentences in the source-side and target-side treebanks are non-overlapping.
In other words, there lacks a bi-tree aligned data in which each sentence has two syntactic trees complying with two guidelines as shown in Figure 1. Consequently, we cannot train a supervised conversion model to directly learn the structural correspondences between the two guidelines. To overcome this obstacle, we construct a bi-tree aligned data of over ten thousand sentences by re-annotating the publicly available dependency-structure HIT-CDT treebank according to a new annotation guideline.

Data Annotation
Annotation guideline.
Unlike phrasestructure treebank construction with very detailed and systematic guidelines (Xue et al., 2005;Zhou, 2004), previous works on Chinese dependency-structure annotation only briefly describe each relation label with a few concrete examples. For example, the HIT-CDT guideline contains 14 relation labels and illustrates them in a 14-page document.
The UD (universal dependencies) project 3 releases a more detailed language-generic guideline to facilitate cross-linguistically consistent annotation, containing 37 relation labels.
However, after in-depth study, we find that the UD guideline is very useful and comprehensive, but may not be completely compact for realistic annotation of Chinese-specific syntax. After many months' investigation and trial, we have developed a systematic and detailed annotation guideline for Chinese dependency treebank construction. Our 60-page guideline employs 20 relation labels and gives detailed illustrations for annotation, in order to improve consistency and quality.
Please refer to Guo et al. (2018) for the details of our guideline, including detailed discussions on the correspondences and differences between the UD guideline and ours.
3 http://universaldependencies.org Partial annotation. To save annotation effort, we adopt the idea of Li et al. (2016) and only annotate the most uncertain (difficult) words in a sentence. For simplicity, we directly use their released parser and produce the uncertainty results of all HLT-CDT sentences via two-fold jack-knifing. First, we select 2, 000 most difficult sentences of lengths [5, 10] for full annotation 4 . Then, we select 3, 000 most difficult sentences of lengths [10, 20] from the remaining data for 50% annotation. Finally, we select 6, 000 most difficult sentences of lengths [5,25] for 20% annotation from the remaining data. The difficulty of a sentence is computed as the averaged difficulty of its selected words.
Annotation platform. To guarantee annotation consistency and data quality, we build an online annotation platform to support strict double annotation and subsequent inconsistency handling. Each sentence is distributed to two random annotators. If the two submissions are not the same (inconsistent dependency or relation label), a third expert annotator will compare them and decide a single answer.
Annotation process. We employ about 20 students in our university as part-time annotators. Before real annotation, we first give a detailed talk on the guideline for about two hours. Then, the annotators spend several days on systematically studying our guideline. Finally, they are required to annotate 50 testing sentences on the platform. If the submission is different from the correct answer, the annotator receives an instant feedback for selfimprovement. Based on their performance, about 10 capable annotators are chosen as experts to deal with inconsistent submissions.

Statistics and Analysis
Consistency statistics. Compared with the final answers, the overall accuracy of all annotators is 87.6%. Although the overall inter-annotator dependency-wise consistency rate is 76.5%, the sentence-wise consistency rate is only 43.7%. In other words, 56.3% (100 − 43.7) sentences are further checked by a third expert annotator. This shows how difficult it is to annotate syntactic structures and how important it is to employ strict double annotation to guarantee data quality. Table 2, the averaged sentence length is 15.4 words in our annotated data, among which 4.7 words (30%) are partially annotated with their heads. According to the records of our annotation platform, each sentence requires about 3 minutes in average, including the annotation time spent by two annotators and a possible expert. The total cost of our data annotation is about 550 person-hours, which can be completed by 20 full-time annotators within 4 days. The most cost is spent on quality control via two-independent annotation and inconsistency handling by experts. This is in order to obtain very high-quality data. The cost is reduced to about 150 personhours without such strict quality control.

Annotation time analysis. As shown in
Heterogeneity analysis. In order to understand the heterogeneity between our guideline and the HIT-CDT guideline, we analyze the 36, 348 words with both-side heads in the train data, as shown in Table 2. The consistency ratio of the two guidelines is 81.69% (UAS), without considering relation labels. By mapping each relation label in HIT-CDT (14 in total) to a single label of our guideline (20 in total), the maximum consistency ratio is 73.79% (LAS). The statistics are similar for the dev/test data.

Indirect Multi-task Learning
Basic parser. In this work, we build all the approaches over the state-of-the-art deep biaffine parser proposed by Dozat and Manning (2017). As a graph-based dependency parser, it employs a deep biaffine neural network to compute the scores of all dependencies, and uses viterbi decoding to find the highestscoring tree. Figure 2 shows how to score a dependency i ← j. 5 First, the biaffine parser applies multi-layer bidirectional sequential LSTMs (biSeqLSTM) to encode the input sentence. The word/tag embeddings e w k and e t k are concatenated as the input vector at w k .
where r H k is the representation vector of w k as a head word, and r D k as a dependent. Finally, the score of the dependency i ← j is computed via a biaffine operation.
During training, the original biaffine parser uses the local softmax loss. For each w i and its head w j , its loss is defined as − log e score(i←j) ∑ k e score(i←k) . Since our training data is partially annotated, we follow Li et al. (2016) and employ the global CRF loss (Ma and Hovy, 2017) for better utilization of the data, leading to consistent accuracy gain.
Multi-task learning aims to incorporate labeled data of multiple related tasks for improving performance (Collobert and Weston, 2008). Guo et al. (2016) apply multi-task learning to multi-treebank exploitation based on the neural transition-based parser of Dyer et al. (2015), and achieve higher improvement than the guiding-feature approach of .
Based on the state-of-the-art biaffine parser, this work makes a straightforward extension to realize multi-task learning. We treat the source-side and target-side parsing as two individual tasks. The two tasks use shared parameters for word/tag embeddings and multilayer biSeqLSTMs to learn common-ground syntactic knowledge, use separate parameters for the MLP and biaffine layers to learn taskspecific information.

Direct Treebank Conversion
Task definition. As shown in Figure 1, given an input sentence x, treebank conversion aims to convert the under source-side tree d src to the upper target-side tree d tgt . Therefore, the main challenge is how to make full use of the given d src to guide the construction of d tgt . Specifically, under the biaffine parser framework, the key is to utilize d src as guidance for better scoring an arbitrary target-side dependency i ← − j.
In this paper, we try to encode the structural information of i and j in d src as a dense vector from two representation levels, thus leading to two approaches, i.e., the shallow pattern embedding approach and the deep treeLSTM approach. The dense vectors are then used as extra inputs of the MLP layer to obtain better word representations, as shown in Figure 2.

The Pattern Embedding Approach
In this subsection, we propose the pattern embedding conversion approach by retrofitting the indirect guiding-feature method of  to the direct conversion scenario, with several substantial extensions.
The basic idea of  is to use extra guiding features produced by the sourceside parser. First, they train the source parser P arser src on the source-side treebank. Then, they use P arser src to parse the target-side treebank, leading to pseudo bi-tree aligned data. Finally, they use the predictions of P arser src as extra pattern-based guiding features and build a better target-side parser P arser tgt .
The original method of  is proposed for traditional discrete-feature based parsing, and does not consider the relation labels in d src . In this work, we make a few useful extensions for more effective utilization of d src .
• We further subdivide their "else" pattern into four cases according to the length of the path from w i to w j in d src . The left part of Figure 2 shows all 9 patterns.
• We use the labels of w i and w j in d src , denoted as l i and l j .
• Inspired by the treeLSTM approach, we also consider the label of w a , the lowest common ancestor (LCA) of w i and w j , denoted as l a .
Our pattern embedding approach works as follows. Given i ← j, we first decide its pattern type according to the structural relationship between w i and w j in d src , denoted as p i←j . For example, if w i and w j are both the children of a third word w k in d src , then p i←j = "sibling". Figure 2 shows all 9 patterns.
Then, we embed p i←j into a dense vector e p i←j through a lookup operation in order to fit into the biaffine parser. Similarly, the three labels are also embedded into three dense vectors, i.e., e l i , e l j , e la .
The four embeddings are combined as r pat i←j to represent the structural information of w i and w j in d src .
Finally, the representation vector r pat i←j and the top-layer biSeqLSTM outputs are concatenated as the inputs of the MLP layer.
Through r pat i←j , the extended word representations, i.e., r D i,i←j and r H j,i←j , now contain the structural information of w i and w j in d src .
The remaining parts of the biaffine parser is unchanged. The extended r D i,i←j and r H j,i←j are fed into the biaffine layer to compute a more reliable score of the dependency i ← j, with the help of the guidance of d src .

The TreeLSTM Approach
Compared with the pattern embedding approach, our second conversion approach employs treeLSTM to obtain a deeper representation of i ← j in the source-side tree d src . Tai et al. (2015) first propose treeLSTM as a generalization of seqLSTM for encoding treestructured inputs, and show that treeLSTM is more effective than seqLSTM on the semantic relatedness and sentiment classification tasks. Miwa and Bansal (2016) compare three treeL-STM variants on the relation extraction task and show that the SP-tree (shortest path) treeLSTM is superior to the full-tree and subtree treeLSTMs.
In this work, we employ the SP-tree treeL-STM of Miwa and Bansal (2016) for our treebank conversion task. Our preliminary experiments also show the SP-tree treeLSTM outperforms the full-tree treeLSTM, which is consistent with Miwa and Bansal. We did not implement the in-between subtree treeLSTM.  Figure 2: Computation of score(i ← j) in our proposed conversion approaches. Without the source-side tree d src , the baseline uses the basic r D i and r H j (instead of r D i,i←j and r H j,i←j ).
Given w i and w j and their LCA w a , the SPtree is composed of two paths, i.e., the path from w a to w i and the path from w a to w j , as shown in the right part of Figure 2.
Different from the shallow pattern embedding approach, the treeLSTM approach runs a bidirectional treeLSTM through the SP-tree, in order to encode the structural information of w i and w j in d src at a deeper level. The topdown treeLSTM starts from w a and accumulates information until w i and w j , whereas the bottom-up treeLSTM propagates information in the opposite direction.
Following Miwa and Bansal (2016), we stack our treeLSTM on top of the biSeqLSTM layer of the basic biaffine parser, instead of directly using word/tag embeddings as inputs. For example, the input vector for w k in the treeL-STM is x k = h seq k ⊕ e l k , where h seq k is the toplevel biSeqLSTM output vector at w k , and l k is the label between w k and its head word in d src , and e l k is the label embedding.
In the bottom-up treeLSTM, an LSTM node computes a hidden vector based on the combination of the input vector and the hidden vectors of its children in the SP-tree. The right part of Figure 2 and Eq. (5) illustrate the computation at w a .
where C(a) means the children of w a in the SP-tree, and f a,k is the forget vector for w a 's child w k . The top-down treeLSTM sends information from the root w a to the leaves w i and w j . An LSTM node computes a hidden vector based on the combination of its input vector and the hidden vector of its single preceding (father) node in the SP-tree.
After performing the biTreeLSTM, we follow Miwa and Bansal (2016) and use the combination of three output vectors to represent the structural information of w i and w j in d src , i.e., the output vectors of w i and w j in the topdown treeLSTM, and the output vector of w a #Sent #Tok (HIT) #Tok (our)  in the bottom-up treeLSTM.
Similar to Eq. (4) for the pattern embedding approach, we concatenate r tree i←j with the output vectors of the top-layer biSeqLSTM, and feed them into MLP H/D .

Experiment Settings
Data. We randomly select 1, 000/2, 000 sentences from our newly annotated data as the dev/test datasets, and the remaining as train. Table 2 shows the data statistics after removing some broken sentences (ungrammatical or wrongly segmented) discovered during annotation. The "#tok (our)" column shows the number of tokens annotated according to our guideline. Train-HIT contains all sentences in HIT-CDT except those in dev/test, among which most sentences only have the HIT-CDT annotations.
Evaluation. We use the standard labeled attachment score (LAS, UAS for unlabeled) to measure the parsing and conversion accuracy.
Implementation. In order to more flexibly realize our ideas, we re-implement the baseline biaffine parser in C++ based on the lightweight neural network library of . On the Chinese CoNLL-2009 data, our parser achieves 85.80% in LAS, whereas the original tensorflow-based parser 6 achieves 85.54% (85.38% reported in their paper) under the same parameter settings and external word embedding.
Hyper-parameters. We follow most parameter settings of Dozat and Manning (2017). The external word embedding dictionary is trained on Chinese Gigaword (LDC2003T09) with GloVe (Pennington et al., 2014). For 6 https://github.com/tdozat/Parser-v1  efficiency, we use two biSeqLSTM layers instead of three, and reduce the biSeqLSTM output dimension (300) and the MLP output dimension (200). For the conversion approaches, the sourceside pattern/label embedding dimensions are 50 (thus |r pat i←j | = 200), and the treeLSTM output dimension is 100 (thus |r tree i←j | = 300). During training, we use 200 sentences as a data batch, and evaluate the model on the dev data every 50 batches (as an epoch). Training stops after the peak LAS on dev does not increase in 50 consecutive epochs.
For the multi-task learning approach, we randomly sample 100 train sentences and 100 train-HIT sentences to compose a data batch, for the purpose of corpus weighting.
To fully utilize train-HIT for the conversion task, the conversion models are built upon multi-task learning, and directly reuse the embeddings and biSeqLSTMs of the multitask trained model without fine-tuning. Table 3 shows the conversion accuracy on the test data. As a strong baseline for the conversion task, the multi-task trained target-side parser ("multi-task") does not use d src during both training and evaluation. In contrast, the conversion approaches use both the sentence x and d src as inputs.

Results: Treebank Conversion
Compared with "multi-task", the two proposed conversion approaches achieve nearly the same accuracy, and are able to dramatically improve the accuracy with the extra guidance of d src . The gain is 7.58 (82.09 − 74.51) in LAS for the treeLSTM approach.
It is straightforward to combine the two conversion approaches. We simply concatenate h seq i/j with both r pat i←j and r tree i←j before feeding into MLP H/D . However, the "combined" model leads to no further improvement. This indicates that although the two approaches try  to encode the structural information of w i and w j in d src from different perspectives, the resulted representations are actually overlapping instead of complementary, which is contrary to our intuition that the treeLSTM approach should give better and deeper representations than the shallow pattern embedding approach.
We have also tried several straightforward modifications to the standard treeLSTM in Eq. (5), but found no further improvement. We leave further exploration of better treeL-STMs and model combination approaches as future work. Feature ablation results are presented in Table 4 to gain more insights on the two proposed conversion approaches.
In each experiment, we remove a single component from the full model to learn its individual contribution.
For the pattern embedding approach, all proposed extensions to the basic pattern-based approach of  are useful. Among the three labels, the embedding of l i is the most useful and its removal leads to the highest LAS drop of 0.88 (82.03 − 81.15). This is reasonable considering that 81.69% dependencies are consistent in the two guidelines, as discussed in the heterogeneity analysis of Section 2.2. Removing all three labels decreases UAS by 0.73 (86.66−85.93) and LAS by 1.95 (82.03 − 80.08), demonstrating that the source-side labels are highly correlative with the target-side labels, and therefore very helpful for improving LAS.
For the treeLSTM approach, the source-side labels in d src are also very useful, improving UAS by 0.49 (86.69 − 86.20) and LAS by 1.53 (82.09 − 80.56).

Results: Utilizing Converted Data
Another important question to be answered is whether treebank conversion can lead to higher parsing accuracy than multi-task learning. In terms of model simplicity, treebank conversion is better because eventually the target-side parser is trained directly on an enlarged homogeneous treebank unlike the multi-task learning approach that needs to simultaneously train two parsers on two heterogeneous treebanks. Table 5 shows the empirical results. Please kindly note that the parsing accuracy looks very low, because the test data is partially annotated and only about 30% most uncertain (difficult) words are manually labeled with their heads according to our guideline, as discussed in Section 2.1.
The first-row, "single" is the baseline targetside parser trained on the train data.
The second-row "single (hetero)" refers to the source-side heterogeneous parser trained on train-HIT and evaluated on the target-side test data. Since the similarity between the two guidelines is high, as discussed in Section 2.2, the source-side parser achieves even higher UAS by 0.21 (76.20 − 75.99) than the baseline target-side parser trained on the small-scale train data. The LAS is obtained by mapping the HIT-CDT labels to ours (Section 2.2).
In the third row, "multi-task" is the targetside parser trained on train & train-HIT with the multi-task learning approach. It significantly outperforms the baseline parser by 4.30 (74.51 − 70.21) in LAS. This shows that the multi-task learning approach can effectively utilize the large-scale train-HIT to help the target-side parsing.
In the fourth row, "single (large)" is the basic parser trained on the large-scale converted train-HIT (homogeneous). We employ the treeLSTM approach to convert all sentences in train-HIT into our guideline. 7 We can see that  Table 5: Parsing accuracy on test data. LAS difference between any two systems is statistically significant (p < 0.005) according to Dan Bikel s randomized parsing evaluation comparer for significance test Noreen (1989).  the single parser trained on the converted data significantly outperforms the parser in the multi-task learning approach by 1.32 (75.83 − 74.51) in LAS.
In summary, we can conclude that treebank conversion is superior to multi-task learning in multi-treebank exploitation for its simplicity and better performance.

Results on fully annotated data
We randomly divided the newly annotated data into train/dev/test, so the test set has a mix of 100%, 50% and 20% annotated sentences. To gain a rough estimation of the performance of different approaches on fully annotated data, we give the results in Table  6. We can see that all the models achieve much higher accuracy on the portion of fully annotated data than on the whole test data as shown in Table 3 and 5, since the dependencies to be evaluated are the most difficult ones in a sentence for the portion of partially annotated data. Moreover, the conversion model can achieve over 90% LAS thanks to the guidance of the source-side HIT-CDT tree. Please also note that there would still be a slight bias, because those fully annotated sentences are chosen as the most difficult ones according to the parsing model but are also very short ([5, 10]).

Conclusions and Future Work
In this work, we for the first time propose the task of supervised treebank conversion by constructing a bi-tree aligned data of over ten thousand sentences. We design two simple yet effective conversion approaches based on the state-of-the-art deep biaffine parser. Results show that 1) the two approaches achieves nearly the same conversion accuracy; 2) relation labels in the source-side tree are very helpful for both approaches; 3) treebank conversion is more effective in multi-treebank exploitation than multi-task learning, and achieves significantly higher parsing accuracy.
In future, we would like to advance this work in two directions: 1) proposing more effective conversion approaches, especially by exploring the potential of treeLSTMs; 2) constructing bi-tree aligned data for other treebanks and exploiting all available single-tree and bi-tree labeled data for better conversion.